Optimizers

Riemann provides a variety of optimization algorithms through the riemann.optim package. These optimizers are used to update the parameters of neural networks during training.

Optimizer Basics

This section covers the fundamental concepts and usage patterns for optimizers in Riemann.

All optimizers in Riemann inherit from the optim.Optimizer class. To use an optimizer, you need to:

Create an optimizer instance with the parameters to optimize
Define a loss function
Zero the gradients
Compute the loss and call backward()
Call the optimizer’s step() method

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create a simple model
model = nn.Linear(10, 1)

# Create an optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Define loss function
loss_fn = nn.MSELoss()

# Training loop
for epoch in range(num_epochs):
    # Zero gradients
    optimizer.zero_grad()

    # Forward pass
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

    # Backward pass
    loss.backward()

    # Update parameters
    optimizer.step()

GD (Gradient Descent)

GD is the most basic optimization algorithm that updates parameters in the direction of the negative gradient.

Basic GD

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Linear(10, 1)

# Create GD optimizer
optimizer = optim.GD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

GD with Weight Decay

Weight decay (L2 regularization) can be added to prevent overfitting.

import riemann as rm
import riemann.optim as optim

# Create GD optimizer with weight decay
optimizer = optim.GD(model.parameters(), lr=0.01, weight_decay=1e-4)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD (Stochastic Gradient Descent)

SGD is a variant of gradient descent that updates parameters using a subset of data at each iteration. It is one of the most fundamental optimization algorithms in machine learning.

Algorithm Explanation:

Compute gradients on a mini-batch: ∇θL(θ)
Apply weight decay: θ = θ * (1 - η * weight_decay)
Update parameters: θ = θ - η * ∇θL(θ)
Optional: Apply momentum and/or Nesterov momentum

Advantages:

Simple and computationally efficient
Straightforward implementation
No hyperparameters to tune except learning rate
Effective for large datasets when used with momentum

Suitable Scenarios:

Training large models with large datasets
When computational resources are limited
As a baseline for comparing other optimization algorithms
When fine-tuning models with careful learning rate scheduling

Basic SGD

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Linear(10, 1)

# Create SGD optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD with Momentum

Momentum helps accelerate SGD in the relevant direction and dampens oscillations.

import riemann as rm
import riemann.optim as optim

# Create SGD optimizer with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD with Nesterov Momentum

Nesterov momentum is a variant of momentum that can provide better convergence by looking ahead in the direction of the momentum vector.

import riemann as rm
import riemann.optim as optim

# Create SGD optimizer with Nesterov momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

SGD with Closure Parameter

The SGD optimizer supports a closure parameter, which is a function that reevaluates the model and returns the loss. This is useful for algorithms that need to compute the loss multiple times per optimization step, or for custom optimization logic.

import riemann as rm
import riemann.optim as optim

# Create SGD optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Define closure function
def closure():
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    return loss

# Training loop using closure
for epoch in range(num_epochs):
    loss = optimizer.step(closure)
    print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

When to Use Closure with SGD

The closure parameter is particularly useful in scenarios such as:

Implementing custom optimization logic that requires multiple forward/backward passes
Using SGD with algorithms that need to compute loss gradients multiple times
Creating consistent interfaces across different optimizers (e.g., LBFGS always requires a closure)
Debugging optimization issues by inspecting loss values at different stages

Adam (Adaptive Moment Estimation)

Adam combines the best properties of AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. It maintains both a moving average of the gradients (momentum) and a moving average of the squared gradients (adaptive learning rate).

Algorithm Explanation:

Compute first moment estimate: m = β1*m + (1-β1)*∇θL(θ)
Compute second moment estimate: v = β2*v + (1-β2)*(∇θL(θ))²
Apply bias correction: m̂ = m/(1-β1^t), v̂ = v/(1-β2^t)
Apply weight decay: θ = θ * (1 - η * weight_decay)
Update parameters: θ = θ - η*m̂/(√v̂ + ε)

Advantages:

Adaptive learning rate per parameter
Combines benefits of momentum and adaptive learning rates
Well-suited for sparse gradients and noisy problems
Generally works well without extensive hyperparameter tuning

Suitable Scenarios:

Training deep neural networks with many parameters
Problems with sparse gradients (e.g., NLP tasks)
Large-scale machine learning problems
When quick convergence is desired
As a default optimizer for most deep learning tasks

Basic Adam

import riemann as rm
import riemann.optim as optim

# Create Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Adam with Weight Decay

Weight decay (L2 regularization) can be added to prevent overfitting.

import riemann as rm
import riemann.optim as optim

# Create Adam optimizer with weight decay
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Adagrad

AdaGrad adapts the learning rate to the parameters, performing smaller updates for parameters associated with frequently occurring features. It maintains a per-parameter learning rate that is scaled based on the historical sum of squared gradients.

Algorithm Explanation:

Compute gradients: ∇θL(θ)
Accumulate squared gradients: G = G + (∇θL(θ))²
Apply weight decay: θ = θ * (1 - η * weight_decay)
Update parameters: θ = θ - η*∇θL(θ)/(√G + ε)
Optional: Apply learning rate decay

Advantages:

Adaptive learning rate per parameter
Automatically scales learning rates for different features
Well-suited for sparse data where some features are infrequent
No need to manually tune learning rate for different parameters

Suitable Scenarios:

Training on sparse datasets (e.g., text classification)
Problems with features of varying frequencies
When you want learning rates to automatically adapt to feature importance
As a baseline for adaptive learning rate methods

Basic Adagrad

import riemann as rm
import riemann.optim as optim

# Create AdaGrad optimizer
optimizer = optim.Adagrad(model.parameters(), lr=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

LBFGS

LBFGS is a quasi-Newton method that approximates the BFGS algorithm using a limited amount of memory. It maintains a history of recent gradient evaluations to approximate the Hessian matrix, allowing for more efficient optimization than full Newton methods.

Algorithm Explanation:

Compute gradients using the closure function
Maintain a history of recent (position, gradient) pairs
Approximate the inverse Hessian matrix using this history
Compute search direction based on the approximate inverse Hessian
Perform line search to find optimal step size
Update parameters and repeat

Advantages:

Memory efficient compared to full BFGS
Faster convergence than first-order methods for smooth objectives
Uses second-order information without explicitly computing the Hessian
Well-suited for small to medium-sized problems

Suitable Scenarios:

Training small to medium-sized models
Problems with smooth objective functions
When you need faster convergence than first-order methods
For batch optimization where you can compute the full gradient
When memory is limited but you still want second-order optimization benefits

LBFGS Parameters Detailed Explanation:

lr: Learning rate (default: 1.0). Unlike other optimizers, LBFGS uses this as a step size multiplier during line search.
max_iter: Maximum number of iterations per optimization step (default: 20). Controls how many times the algorithm will attempt to find an optimal step.
max_eval: Maximum number of function evaluations per optimization step (default: None). If None, defaults to max_iter * 1.25.
tolerance_grad: Gradient tolerance for convergence (default: 1e-05). The optimization stops when the gradient norm falls below this threshold.
tolerance_change: Parameter change tolerance for convergence (default: 1e-09). The optimization stops when parameter updates fall below this threshold.
history_size: Update history size (default: 100). Controls how much memory is used to approximate the Hessian matrix.
line_search_fn: Line search function (default: None). Custom line search function to use instead of the default implementation.

Basic LBFGS

import riemann as rm
import riemann.optim as optim

# Create LBFGS optimizer
optimizer = optim.LBFGS(model.parameters(), lr=1.0)

# Define closure function for LBFGS
def closure():
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    return loss

# Training loop
for epoch in range(num_epochs):
    loss = optimizer.step(closure)
    print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

LBFGS with Custom Parameters

You can customize LBFGS parameters to better suit your specific optimization problem:

import riemann as rm
import riemann.optim as optim

# Create LBFGS optimizer with custom parameters
optimizer = optim.LBFGS(
    model.parameters(),
    lr=1.0,               # Learning rate (step size multiplier)
    max_iter=50,          # Maximum iterations per step
    max_eval=None,        # Maximum function evaluations (None = max_iter * 1.25)
    tolerance_grad=1e-06,  # Gradient convergence threshold
    tolerance_change=1e-09, # Parameter change convergence threshold
    history_size=100,      # Hessian approximation memory
    line_search_fn=None    # Use default line search
)

# Define closure function
def closure():
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    return loss

# Training loop
for epoch in range(num_epochs):
    loss = optimizer.step(closure)
    print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

Parameter Tuning Tips for LBFGS

lr: Start with 1.0 and adjust based on convergence. For ill-conditioned problems, smaller values (0.1-0.5) may be more stable.
max_iter: Increase for more complex optimization problems, but be aware of increased computation time.
history_size: Larger values may improve convergence but increase memory usage. For small models, 100-200 is typically sufficient.
tolerance_grad and tolerance_change: Adjust based on desired precision. Smaller values lead to more precise optimization but may require more iterations.
line_search_fn: Only customize if you have specific line search requirements. The default implementation works well for most problems.

AdamW (Adam with Weight Decay)

AdamW is an improved version of Adam that treats weight decay as a separate regularization term instead of modifying the gradients. This allows weight decay to more effectively act as L2 regularization, avoiding the weight decay side effects present in Adam.

Algorithm Explanation:

Compute first moment estimate: m = β1*m + (1-β1)*∇θL(θ)
Compute second moment estimate: v = β2*v + (1-β2)*(∇θL(θ))²
Apply bias correction: m̂ = m/(1-β1^t), v̂ = v/(1-β2^t)
Apply weight decay: θ = θ * (1 - η * weight_decay)
Update parameters: θ = θ - η*m̂/(√v̂ + ε)

Advantages:

More effective weight decay as a regularizer
Better generalization performance
Avoids Adam’s weight decay issues
Well-suited for modern deep learning architectures

Suitable Scenarios:

Training deep neural networks
Transfer learning and fine-tuning
Models with many parameters prone to overfitting
Any scenario where Adam is used, especially with weight decay

Basic AdamW

import riemann as rm
import riemann.optim as optim

# Create AdamW optimizer
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

AdamW with Custom Betas

import riemann as rm
import riemann.optim as optim

# Create AdamW optimizer with custom betas
optimizer = optim.AdamW(model.parameters(), lr=0.001, betas=(0.9, 0.999), weight_decay=0.01)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

RMSprop (Root Mean Square Propagation)

RMSprop is an adaptive learning rate optimization algorithm that maintains a moving average of the squared gradients for each parameter. It is particularly well-suited for recurrent neural networks (RNNs) and other sequence models.

Algorithm Explanation:

Compute moving average of squared gradients: v = α*v + (1-α)*(∇θL(θ))²
Update parameters: θ = θ - η*∇θL(θ)/(√v + ε)
Optional: Apply momentum and/or use centered version

Advantages:

Adaptive learning rate per parameter
Effective for non-stationary objectives
Well-suited for RNNs and sequence models
Helps with the vanishing/exploding gradient problem

Suitable Scenarios:

Training recurrent neural networks (RNNs)
Training long short-term memory (LSTM) networks
Training gated recurrent units (GRUs)
Any model with sequence data
Models where gradients can vary significantly over time

Basic RMSprop

import riemann as rm
import riemann.optim as optim

# Create RMSprop optimizer
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

RMSprop with Momentum

import riemann as rm
import riemann.optim as optim

# Create RMSprop optimizer with momentum
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

RMSprop with Centered Gradients

import riemann as rm
import riemann.optim as optim

# Create RMSprop optimizer with centered gradients
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99, centered=True)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Optimizer Methods

Zero Gradients

import riemann as rm
import riemann.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

# Zero gradients
optimizer.zero_grad()

Step

import riemann as rm
import riemann.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

# Perform a single optimization step
optimizer.step()

State Dict

import riemann as rm
import riemann.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

# Get optimizer state
state_dict = optimizer.state_dict()

# Load optimizer state
optimizer.load_state_dict(state_dict)

Parameter Groups

Parameter groups are a powerful feature of Riemann optimizers that allow you to configure different hyperparameters for different sets of parameters in your model. This is particularly useful in scenarios such as:

Setting different learning rates for different layers of your model
Applying different hyperparameters to weight and bias parameters
Using different learning rates when fine-tuning pre-trained models

Basic Structure of Parameter Groups

Parameter groups are defined through a list of dictionaries, where each dictionary contains:

params：The set of parameters to optimize
Other key-value pairs：Hyperparameters specific to this parameter group (e.g., lr, weight_decay)

Basic Usage Example

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)

# Create parameter groups
optimizer = optim.SGD([
    {'params': model[0].parameters(), 'lr': 0.01},
    {'params': model[2].parameters(), 'lr': 0.001}
], momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

In the example above: - The first parameter group contains all parameters from the first layer (model[0]) with a learning rate of 0.01 - The second parameter group contains all parameters from the third layer (model[2]) with a learning rate of 0.001 - Both parameter groups share the momentum=0.9 hyperparameter

Setting Different Hyperparameters for Weights and Biases

import riemann as rm
import riemann.optim as optim

# Create model
model = nn.Linear(10, 1)

# Separate weight and bias parameters
weight_params = [p for name, p in model.named_parameters() if 'weight' in name]
bias_params = [p for name, p in model.named_parameters() if 'bias' in name]

# Create parameter groups
optimizer = optim.SGD([
    {'params': weight_params, 'lr': 0.01, 'weight_decay': 1e-4},
    {'params': bias_params, 'lr': 0.02, 'weight_decay': 0}
])

In this example: - Weight parameters use a smaller learning rate (0.01) and weight decay (1e-4) - Bias parameters use a larger learning rate (0.02) and no weight decay

Using Parameter Groups in Pre-trained Models

import riemann as rm
import riemann.optim as optim

# Create a pre-trained model (simplified example)
class PretrainedModel(nn.Module):
    def __init__(self):
        super(PretrainedModel, self).__init__()
        # Assume this part is a pre-trained feature extractor
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        # Assume this part is a newly added classifier
        self.classifier = nn.Linear(64, 10)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

model = PretrainedModel()

# Create parameter groups
optimizer = optim.SGD([
    {'params': model.features.parameters(), 'lr': 0.001},  # Lower LR for pre-trained features
    {'params': model.classifier.parameters(), 'lr': 0.01}  # Higher LR for new classifier
], momentum=0.9)

In this example: - The pre-trained feature extractor uses a smaller learning rate (0.001) to avoid disrupting learned features - The newly added classifier uses a larger learning rate (0.01) to speed up its convergence

How Parameter Groups Work

When you create an optimizer with parameter groups:

The optimizer maintains separate states for each parameter group
During each step() call, the optimizer updates parameters according to the hyperparameters of their respective groups
If a hyperparameter is not specified in a parameter group, the optimizer uses the default value provided in the constructor

Best Practices

Clear Naming：Use named_parameters() to create parameter groups based on parameter names for better code readability
Logical Grouping：Group parameters based on their nature and importance, such as: - Parameters from different layers - Weight and bias parameters - Pre-trained and newly added parameters
Learning Rate Scheduling：Parameter group learning rates can be used with learning rate schedulers, which will adjust them based on their initial learning rates
Hyperparameter Search：Parameter groups enable more flexible hyperparameter searches by allowing different configurations for different parts of your model

Gradient Clipping

Gradient clipping is a technique to prevent exploding gradients in deep networks by limiting the size of gradients to ensure training stability. During the training of deep neural networks, gradients can become extremely large, leading to excessive parameter updates and divergence of the training process.

Benefits of Gradient Clipping

Prevents Exploding Gradients: Limits the maximum gradient value, avoiding excessive parameter updates
Prevents Exploding Gradients: Limits the maximum gradient value, avoiding excessive parameter updates
Improves Training Stability: Makes the training process more stable and reduces training fluctuations
Speeds Up Convergence: Can help the model converge faster in some cases
Allows Larger Learning Rates: By limiting gradients, larger initial learning rates can be used

Clip by Norm

Clipping by norm works by calculating the L2 norm of the gradients and limiting it to a maximum norm. This method preserves the direction of the gradients while adjusting their magnitude.

Parameter Description: - parameters: The parameter collection whose gradients to clip - max_norm: The maximum norm of the gradients - norm_type: The type of norm to use, default is 2 (L2 norm) - error_if_nonfinite: Whether to throw an error if gradients contain non-finite values (like NaN or inf), default is False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Clip gradients by norm
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Use Cases: - Suitable for most deep neural network training - Especially useful when using RNNs or LSTMs - When you observe NaN or inf in training loss

Practical Application Example:

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Create optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()

        # Clip gradients before optimizer step
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()

Clip by Value

Clipping by value works by limiting each element of the gradients to a specified range. This method directly truncates extreme values in the gradients.

Parameter Description: - parameters: The parameter collection whose gradients to clip - clip_value: The maximum absolute value of gradients - error_if_nonfinite: Whether to throw an error if gradients contain non-finite values (like NaN or inf), default is False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Clip gradients by value
nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

Use Cases: - When you want to directly control the maximum absolute value of gradients - When there are extreme outliers in the gradients - For specific network architectures like discriminators in GANs

Practical Application Example:

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(100, 50),
    nn.ReLU(),
    nn.Linear(50, 10)
)

# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()

        # Clip gradients before optimizer step
        nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

        optimizer.step()

Best Practices for Gradient Clipping

Choose the Right Clipping Method: - Clip by norm (clip_grad_norm_): Suitable for most cases, especially RNNs - Clip by value (clip_grad_value_): Suitable for cases with extreme gradient values
Set Reasonable Clipping Thresholds: - Clip by norm: max_norm is typically set between 0.5 and 5.0 - Clip by value: clip_value is typically set between 0.1 and 1.0
Clipping Timing: - Must be executed after loss.backward() and before optimizer.step() - Should be applied to every batch
Combining with Other Techniques: - Use with learning rate schedulers - Use with parameter groups to apply different clipping strategies to different layers
Monitoring Effectiveness: - Observe if training loss becomes more stable - Check if exploding gradients still occur - Adjust clipping thresholds for optimal results

Examples

Training a Neural Network with Adam

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create model
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)

# Create optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Training loop
for epoch in range(50):
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()

        # Gradient clipping
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        total_loss += loss.item()

    print(f'Epoch {epoch}, Loss: {total_loss/len(train_loader):.4f}')

Training with Different Learning Rates for Different Layers

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim

# Create a pre-trained model (simplified example)
class PretrainedModel(nn.Module):
    def __init__(self):
        super(PretrainedModel, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1))
        )
        self.classifier = nn.Linear(64, 10)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

model = PretrainedModel()

# Create parameter groups with different learning rates
optimizer = optim.SGD([
    {'params': model.features.parameters(), 'lr': 0.001},  # Lower LR for pretrained features
    {'params': model.classifier.parameters(), 'lr': 0.01}  # Higher LR for new classifier
], momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Custom Optimizer

You can create custom optimizers by inheriting from optim.Optimizer.

import riemann as rm
import riemann.optim as optim

class CustomSGD(optim.Optimizer):
    def __init__(self, params, lr=0.01, momentum=0):
        defaults = dict(lr=lr, momentum=momentum)
        super(CustomSGD, self).__init__(params, defaults)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue

                d_p = p.grad.data
                if group['momentum'] != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = rm.clone(d_p).detach()
                    else:
                        buf = param_state['momentum_buffer']
                        buf.mul_(group['momentum']).add_(d_p)
                        d_p = buf

                p.data.add_(-group['lr'], d_p)

        return loss

# Use custom optimizer
optimizer = CustomSGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)
    loss.backward()
    optimizer.step()

Learning Rate Schedulers

Learning rate schedulers are used to dynamically adjust the learning rate during training, which is crucial for model convergence and performance optimization. Riemann provides several learning rate schedulers, each with its specific adjustment strategy.

Types of Learning Rate Schedulers

StepLR

Function: Adjusts the learning rate by a fixed step size and decay factor.

Parameters:

optimizer: The optimizer whose learning rate is to be adjusted
step_size: The step size for learning rate decay
gamma: The learning rate decay factor, default is 0.1
last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where learning rate needs to be reduced at fixed intervals.

MultiStepLR

Function: Adjusts the learning rate at specified milestones.

Parameters:

optimizer: The optimizer whose learning rate is to be adjusted
milestones: A list of milestones for learning rate decay
gamma: The learning rate decay factor, default is 0.1
last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where learning rate needs to be reduced at specific epochs.

ExponentialLR

Function: Adjusts the learning rate with exponential decay.

Parameters:

optimizer: The optimizer whose learning rate is to be adjusted
gamma: The learning rate decay factor
last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where a continuous smooth decay of learning rate is needed.

CosineAnnealingLR

Function: Adjusts the learning rate according to the shape of a cosine function.

Parameters:

optimizer: The optimizer whose learning rate is to be adjusted
T_max: The period of cosine annealing
eta_min: The minimum learning rate, default is 0
last_epoch: The index of the last epoch, default is -1

Use Case: Suitable for scenarios where learning rate needs to first decrease and then increase, helping to escape local optima.

ReduceLROnPlateau

Function: Adjusts the learning rate when a metric stops improving.

Parameters:

optimizer: The optimizer whose learning rate is to be adjusted
mode: Mode, ‘min’ or ‘max’, default is ‘min’
factor: The learning rate decay factor, default is 0.1
patience: Number of epochs with no improvement after which learning rate will be reduced, default is 10
threshold: Threshold for measuring new best, default is 1e-4
threshold_mode: Threshold mode, ‘rel’ or ‘abs’, default is ‘rel’
cooldown: Number of epochs to wait before resuming normal operation after lr has been reduced, default is 0
min_lr: Minimum learning rate, default is 0
eps: Minimum change in learning rate to qualify as an improvement, default is 1e-8

Use Case: Suitable for scenarios where learning rate needs to be dynamically adjusted based on validation metrics.

Using Learning Rate Schedulers

The basic usage flow of learning rate schedulers is as follows:

Create an optimizer
Create a learning rate scheduler, passing in the optimizer and relevant parameters
In the training loop, first call the optimizer’s step() method to update parameters
Then call the scheduler’s step() method to update the learning rate

Scheduler and Optimizer Interaction

Order: It is recommended to call optimizer.step() first, then scheduler.step()
Parameter Groups: The scheduler will adjust each parameter group based on its initial learning rate
State Saving: The scheduler’s state can be saved and loaded using state_dict() and load_state_dict() methods
Special Case: The ReduceLROnPlateau scheduler requires a validation metric to be passed in the step() method

Complete Example Code

import riemann as rm
import riemann.nn as nn
import riemann.optim as optim
from riemann.optim.lr_scheduler import StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau

# Create model
model = nn.Sequential(
    nn.Linear(10, 50),
    nn.ReLU(),
    nn.Linear(50, 1)
)

# Create optimizer
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

# Create learning rate scheduler (choose one)
# 1. StepLR example
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# 2. MultiStepLR example
# scheduler = MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)

# 3. ExponentialLR example
# scheduler = ExponentialLR(optimizer, gamma=0.99)

# 4. CosineAnnealingLR example
# scheduler = CosineAnnealingLR(optimizer, T_max=50, eta_min=0.001)

# 5. ReduceLROnPlateau example (needs validation loss in step())
# scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)

# Define loss function
loss_fn = nn.MSELoss()

# Generate example data
inputs = rm.randn(100, 10)
targets = rm.randn(100, 1)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass
    outputs = model(inputs)
    loss = loss_fn(outputs, targets)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Update learning rate
    scheduler.step()  # For ReduceLROnPlateau, use scheduler.step(loss.item())

    # Print information
    if (epoch + 1) % 10 == 0:
        current_lr = scheduler.get_lr()[0] if hasattr(scheduler, 'get_lr') else optimizer.param_groups[0]['lr']
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, LR: {current_lr:.6f}')

Best Practices for Learning Rate Schedulers

Choose the Right Scheduler: Select the appropriate learning rate scheduling strategy based on task characteristics
Set a Reasonable Initial Learning Rate: The initial learning rate should not be too large or too small
Combine with Validation Set: When using ReduceLROnPlateau, it should be based on validation metrics rather than training metrics
Learning Rate Lower Bound: Set a reasonable min_lr for ReduceLROnPlateau to prevent training stagnation due to extremely small learning rates
Warm-up Phase: For large models, consider using a smaller learning rate for warm-up in the early training stages
Parameter Group Compatibility: When used with parameter groups, ensure that each parameter group has a reasonable initial learning rate