How to Build a Neural Network

Riemann provides a comprehensive set of neural network modules through the riemann.nn package. These modules are building blocks for creating and training neural networks.

This section provides a step-by-step guide on how to build, train, and evaluate a complete neural network using Riemann. We will use MNIST handwritten digit recognition as an example to demonstrate the entire process from data preparation to model evaluation.

Step 1: Data Preparation

Before building a neural network, you need to prepare your dataset. Riemann provides the Dataset and DataLoader interfaces for data loading and processing.

Understanding Dataset

Dataset is an abstract base class used to represent a dataset. It defines two core methods that subclasses must implement:

__len__(): Returns the number of samples in the dataset
__getitem__(idx): Returns a sample based on the index

Why use Dataset?

The Dataset abstraction provides a unified interface for accessing data, allowing the training loop to handle different data sources (images, text, audio, etc.) in the same way. It also enables lazy loading, where data is only loaded into memory when needed.

Using Built-in Datasets vs. Custom Datasets

Riemann provides built-in datasets for common tasks. For computer vision tasks, you can use datasets from riemann.vision.datasets:

from riemann.vision.datasets import MNIST

# Use Riemann's built-in MNIST dataset
train_dataset = MNIST(root='./data', train=True, transform=transform)

If you need to use your own data, you can create a custom dataset by inheriting from Dataset:

from riemann.utils.data import Dataset
import riemann as rm

class MyCustomDataset(Dataset):
    """Example of a custom dataset for structured data"""

    def __init__(self, data_path):
        # Load your data here
        self.data = rm.load(data_path)  # Example: load from file
        self.labels = rm.load_labels(data_path)

    def __len__(self):
        """Return the total number of samples"""
        return len(self.data)

    def __getitem__(self, idx):
        """Return a single sample (data, label)"""
        return self.data[idx], self.labels[idx]

In this tutorial, we use Riemann’s built-in MNIST dataset, which automatically downloads and manages the data for us.

Data Transformation with Transforms

transforms is used for data preprocessing and augmentation. You can compose multiple transformations using transforms.Compose:

from riemann.vision import transforms

# Define data transformations
transform = transforms.Compose([
    transforms.ToTensor(),           # Convert image to tensor
    transforms.Normalize((0.1307,), (0.3081,))  # Normalize with mean and std
])

Key Concepts:

ToTensor(): Converts PIL Image or numpy array to tensor and scales pixel values from [0, 255] to [0.0, 1.0]. This normalization is important because neural networks work best with small input values (typically 0-1 or -1 to 1).
Normalize(mean, std): Normalizes tensor with mean and standard deviation: output = (input - mean) / std. For MNIST, the values (0.1307, 0.3081) are pre-computed statistics of the dataset. Normalization helps the network learn faster by ensuring all features are on a similar scale.

Loading MNIST Dataset

Data Root Directory Management

To simplify data storage location management, Riemann provides the get_data_root() utility function to get the project’s data root directory:

from riemann.utils import get_data_root

# Get the data root directory path
data_root = get_data_root()
print(f"Data root: {data_root}")
# Example output: D:\\code\\Riemann\\data

This function automatically locates the data folder under the project root directory, avoiding the need to manually specify paths in different environments.

Loading the Dataset

from riemann.utils import get_data_root

# Load training and test datasets
train_dataset = MNIST(
    root=get_data_root(),  # Use utility function to get data root
    train=True,            # True for training set, False for test set
    transform=transform    # Data transformation to apply
)

test_dataset = MNIST(
    root=get_data_root(),
    train=False,
    transform=transform
)

print(f"Training set size: {len(train_dataset)}")
print(f"Test set size: {len(test_dataset)}")

Using DataLoader for Batch Processing

DataLoader is used for batch loading of data, supporting data shuffling and automatic batching:

from riemann.utils.data import DataLoader

# Create DataLoader for training
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=100,     # Number of samples per batch
    shuffle=True        # Shuffle data at every epoch
)

# Create DataLoader for testing
test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=1,       # Process one sample at a time for testing
    shuffle=False       # No need to shuffle test data
)

Why use DataLoader and batch training?

Training on batches (mini-batches) instead of single samples offers several advantages:

Computational Efficiency: Processing multiple samples together allows for better utilization of hardware (CPU/GPU) through vectorized operations.
Memory Efficiency: You don’t need to load the entire dataset into memory at once. DataLoader loads batches on-demand.
More Stable Gradients: Gradients computed from a batch are less noisy than those from a single sample, leading to more stable training.
Generalization: Batch training with shuffling helps the model generalize better by seeing data in different orders each epoch.

Key Parameters:

dataset: The dataset to load data from
batch_size: How many samples per batch to load. Common values are 32, 64, 128, or 256. Larger batches give more stable gradients but require more memory.
shuffle: Set to True to have the data reshuffled at every epoch. This prevents the model from learning the order of data and improves generalization.

Step 2: Building the Neural Network

Neural networks in Riemann are built by inheriting from nn.Module and implementing the forward method.

Understanding nn.Module

nn.Module is the base class for all neural network modules. It provides:

Parameter Management: Automatically tracks learnable parameters (weights and biases)
Submodule Management: Supports nested modules, allowing complex architectures
Device Management: Supports CPU/GPU execution with simple .to('cuda') calls
Training/Evaluation Modes: train() and eval() methods control behaviors like dropout

Defining the Network Architecture

For MNIST classification, we’ll build a simple feedforward neural network:

import riemann.nn as nn
import riemann.optim as opt

class Classifier(nn.Module):
    """
    MNIST Handwritten Digit Classifier

    Network Architecture:
    - Input Layer: 784 neurons (28x28 pixels flattened)
    - Hidden Layer: 200 neurons with ReLU activation
    - Output Layer: 10 neurons (for digits 0-9)
    """
    def __init__(self):
        super().__init__()

        # Define network layers using Sequential container
        self.model = nn.Sequential(
            nn.Flatten(),           # Flatten (1, 28, 28) to (1, 784)
            nn.Linear(784, 200),    # Input to hidden layer
            nn.ReLU(),              # Activation function
            nn.Linear(200, 10)      # Hidden to output layer
        )

        # Define loss function for multi-class classification
        self.loss_func = nn.CrossEntropyLoss()

        # Define optimizer with Adam algorithm
        self.optimizer = opt.Adam(
            self.parameters(),      # Parameters to optimize
            lr=0.001,               # Learning rate
            betas=(0.9, 0.999),     # Coefficients for running averages
            weight_decay=0.0001     # L2 regularization
        )

    def forward(self, inputs):
        """
        Forward pass

        Args:
            inputs: Tensor of shape (batch_size, 1, 28, 28)

        Returns:
            Tensor of shape (batch_size, 10) - unnormalized logits
        """
        return self.model(inputs)

Understanding Each Component:

nn.Sequential: A container that executes modules in sequence. It’s like a pipeline where data flows from the first layer to the last. This simplifies the forward pass definition.
nn.Flatten: Flattens the input tensor from (batch, 1, 28, 28) to (batch, 784). MNIST images are 28x28 pixels, but neural networks expect a 1D vector as input. Flatten reshapes the data without changing its values.
nn.Linear: A fully connected (dense) layer that applies the transformation y = xW^T + b, where W is the weight matrix and b is the bias vector. - nn.Linear(784, 200) means 784 inputs (flattened image) → 200 outputs (hidden neurons) - The network learns the optimal weights during training
nn.ReLU (Activation Function): Rectified Linear Unit applies f(x) = max(0, x). Activation functions introduce non-linearity, allowing the network to learn complex patterns. Without activation functions, multiple linear layers would collapse into a single linear transformation.
nn.CrossEntropyLoss: The loss function measures how wrong the model’s predictions are. Cross-entropy is ideal for multi-class classification because: - It penalizes confident wrong predictions heavily - It works directly with raw model outputs (logits), no need for softmax - It combines LogSoftmax and Negative Log-Likelihood for numerical stability
Adam Optimizer: Adam (Adaptive Moment Estimation) is an optimization algorithm that: - Adapts the learning rate for each parameter individually - Uses momentum to accelerate convergence - Combines the benefits of AdaGrad and RMSProp - The learning rate (lr=0.001) controls how big each update step is

Step 3: Training the Network

Training involves iterating over the dataset multiple times (epochs), computing predictions, calculating loss, and updating parameters.

How Parameters Learn: The Core Mechanism

Neural networks learn by iteratively adjusting their parameters (weights and biases) to minimize the loss function. Here’s how it works:

Forward Pass: Input data flows through the network, producing predictions
Loss Calculation: Compare predictions with true labels to compute error
Backward Pass (Backpropagation): Calculate gradients - how much each parameter contributed to the error
Parameter Update: Adjust parameters in the direction that reduces loss

This process repeats for thousands or millions of iterations until the model converges.

Training Step Implementation

class Classifier(nn.Module):
    # ... __init__ and forward methods from above ...

    def train_step(self, inputs, targets):
        """
        Execute one training step

        Args:
            inputs: Batch of images, shape (batch_size, 1, 28, 28)
            targets: Batch of labels, shape (batch_size,)

        Returns:
            loss: Scalar loss value
        """
        # Forward pass: compute predictions
        outputs = self.forward(inputs)

        # Compute loss
        loss = self.loss_func(outputs, targets)

        # Backward pass: compute gradients
        self.optimizer.zero_grad(True)  # Clear previous gradients
        loss.backward()                  # Compute gradients

        # Update parameters
        self.optimizer.step()

        return loss

Detailed Explanation of Each Step:

Forward Pass: The input batch (100 images) goes through the network: - Flatten: (100, 1, 28, 28) → (100, 784) - Linear + ReLU: (100, 784) → (100, 200) - Linear: (100, 200) → (100, 10) - raw scores for each digit
Loss Computation: CrossEntropyLoss compares the predicted scores with the true labels. It produces a single number representing “how wrong” the predictions are.
zero_grad(True): Clears gradients from the previous iteration. Gradients accumulate by default, so we must clear them before computing new ones.
backward(): Computes gradients of the loss with respect to all parameters using the chain rule of calculus. This tells us which direction to adjust each parameter to reduce loss.
optimizer.step(): Updates all parameters using the computed gradients. The optimizer applies the learning rate and any momentum/adaptive scaling.

Complete Training Loop

# Create model instance
model = Classifier()

# Training configuration
epochs = 3

# Training loop
for epoch in range(epochs):
    model.train()  # Set model to training mode
    epoch_loss = 0.0
    num_batches = len(train_loader)

    # Iterate over batches
    for batch_idx, batch in enumerate(train_loader):
        img_tensors, target_tensors = batch

        # Execute training step
        loss = model.train_step(img_tensors, target_tensors)
        epoch_loss += loss.item()

        # Print progress every 100 batches
        if batch_idx % 100 == 0:
            print(f'Epoch {epoch+1}/{epochs}, '
                  f'Batch {batch_idx}/{num_batches}, '
                  f'Loss: {loss.item():.4f}')

    # Calculate average loss for the epoch
    avg_loss = epoch_loss / num_batches
    print(f'Epoch {epoch+1}/{epochs} completed, Average Loss: {avg_loss:.4f}')

Training Process Explained:

An epoch is one complete pass through the entire training dataset. With 60,000 training images and a batch size of 100, we have 600 batches per epoch.

The loss value should decrease over time: - High loss (2.0+) = model is guessing randomly - Medium loss (0.5-1.0) = model is learning but still makes mistakes - Low loss (0.1-0.3) = model is confident and mostly correct

Step 4: Evaluation and Inference

After training, evaluate the model on the test set to measure its generalization performance.

Understanding Accuracy and Model Performance

What is Accuracy?

Accuracy is the percentage of correctly classified samples out of all samples. For MNIST with 10,000 test images, if the model correctly classifies 9,500, the accuracy is 95%.

Factors Affecting Accuracy:

Model Architecture: More layers/neurons can learn complex patterns but may overfit
Training Duration: Too few epochs = underfitting; too many = overfitting
Learning Rate: Too high = unstable training; too low = slow convergence
Data Quality: Clean, well-labeled data produces better models
Regularization: Techniques like weight_decay prevent overfitting
Data Augmentation: Transformations during training improve generalization

Overfitting vs. Underfitting:

Underfitting: Training accuracy is low. The model is too simple or hasn’t trained long enough.
Overfitting: Training accuracy is high but test accuracy is low. The model memorized training data instead of learning general patterns.
Good Fit: Both training and test accuracy are high and close to each other.

Evaluation Method

class Classifier(nn.Module):
    # ... previous methods ...

    def evaluate(self, dataloader):
        """
        Evaluate model performance

        Args:
            dataloader: DataLoader providing test data

        Returns:
            accuracy: Classification accuracy (0-1)
            avg_loss: Average loss over the dataset
        """
        total_loss = 0
        correct = 0
        total = 0

        for batch in dataloader:
            img_tensors, target_tensors = batch

            # Forward pass
            outputs = self.forward(img_tensors)

            # Compute loss
            loss = self.loss_func(outputs, target_tensors)
            total_loss += loss.item()

            # Compute accuracy
            predicted = outputs.argmax(dim=1)  # Get predicted class
            total += target_tensors.size(0)
            correct += (predicted == target_tensors).sum().item()

        accuracy = correct / total
        avg_loss = total_loss / len(dataloader)
        return accuracy, avg_loss

How Accuracy is Calculated:

outputs.argmax(dim=1): For each sample, find the index of the highest score. This index is the predicted digit (0-9).
Compare with targets: Check if predicted == target_tensors to get a boolean tensor of correct/incorrect predictions.
Sum and divide: Count correct predictions and divide by total samples to get accuracy.

Running Evaluation

# Set model to evaluation mode
model.eval()

# Evaluate on test set
test_accuracy, test_loss = model.evaluate(test_loader)
print(f'Test Accuracy: {test_accuracy:.4f}')
print(f'Test Loss: {test_loss:.4f}')

Key Points:

model.eval(): Sets the model to evaluation mode. This disables dropout (if used) and batch normalization updates. It’s crucial for consistent evaluation results.
Test vs. Training Performance: Test accuracy is usually slightly lower than training accuracy. A small gap (1-3%) is normal. A large gap indicates overfitting.
Loss vs. Accuracy: Loss measures confidence; accuracy measures correctness. A model can have high accuracy but high loss if it’s uncertain about correct predictions, or low accuracy but low loss if it’s confidently wrong.

Step 5: Complete Example

Here is the complete runnable code for MNIST handwritten digit recognition:

import sys
import os
import time

# Import Riemann modules
import riemann.nn as nn
import riemann.optim as opt
from riemann.vision.datasets import MNIST
from riemann.vision import transforms
from riemann.utils.data import DataLoader


class Classifier(nn.Module):
    """MNIST Handwritten Digit Classifier"""

    def __init__(self):
        super().__init__()

        # Network architecture
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 200),
            nn.ReLU(),
            nn.Linear(200, 10)
        )

        # Loss function and optimizer
        self.loss_func = nn.CrossEntropyLoss()
        self.optimizer = opt.Adam(
            self.parameters(),
            lr=0.001,
            betas=(0.9, 0.999),
            weight_decay=0.0001
        )

    def forward(self, inputs):
        return self.model(inputs)

    def train_step(self, inputs, targets):
        outputs = self.forward(inputs)
        loss = self.loss_func(outputs, targets)
        self.optimizer.zero_grad(True)
        loss.backward()
        self.optimizer.step()
        return loss

    def evaluate(self, dataloader):
        total_loss = 0
        correct = 0
        total = 0

        for batch in dataloader:
            img_tensors, target_tensors = batch
            outputs = self.forward(img_tensors)

            loss = self.loss_func(outputs, target_tensors)
            total_loss += loss.item()

            predicted = outputs.argmax(dim=1)
            total += target_tensors.size(0)
            correct += (predicted == target_tensors).sum().item()

        accuracy = correct / total
        avg_loss = total_loss / len(dataloader)
        return accuracy, avg_loss


def main():
    print("MNIST Handwritten Digit Recognition")

    # Step 1: Data preparation
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])

    print("Loading datasets...")
    train_dataset = MNIST(root='./data', train=True, transform=transform)
    test_dataset = MNIST(root='./data', train=False, transform=transform)

    train_loader = DataLoader(dataset=train_dataset, batch_size=100, shuffle=True)
    test_loader = DataLoader(dataset=test_dataset, batch_size=1, shuffle=False)

    print(f"Training set size: {len(train_dataset)}")
    print(f"Test set size: {len(test_dataset)}")

    # Step 2: Create model
    print("\nInitializing model...")
    model = Classifier()

    # Step 3: Training
    print("\nStarting training...")
    epochs = 3
    train_start_time = time.time()

    for epoch in range(epochs):
        model.train()
        epoch_loss = 0.0
        num_batches = len(train_loader)

        for batch_idx, batch in enumerate(train_loader):
            img_tensors, target_tensors = batch
            loss = model.train_step(img_tensors, target_tensors)
            epoch_loss += loss.item()

            if batch_idx % 100 == 0:
                print(f'Epoch {epoch+1}/{epochs}, '
                      f'Batch {batch_idx}/{num_batches}, '
                      f'Loss: {loss.item():.4f}')

        avg_loss = epoch_loss / num_batches
        print(f'Epoch {epoch+1}/{epochs} completed, '
              f'Average Loss: {avg_loss:.4f}')

        # Step 4: Evaluation
        model.eval()
        test_accuracy, test_loss = model.evaluate(test_loader)
        print(f'Test Accuracy: {test_accuracy:.4f}, '
              f'Test Loss: {test_loss:.4f}')
        print('-' * 50)

    train_end_time = time.time()
    print(f"Total training time: {train_end_time - train_start_time:.2f} seconds")


if __name__ == "__main__":
    main()

Expected Output

When you run the complete example, you should see output similar to:

MNIST Handwritten Digit Recognition
Loading datasets...
Training set size: 60000
Test set size: 10000

Initializing model...

Starting training...
Epoch 1/3, Batch 0/600, Loss: 2.3124
Epoch 1/3, Batch 100/600, Loss: 0.5231
Epoch 1/3, Batch 200/600, Loss: 0.3412
Epoch 1/3, Batch 300/600, Loss: 0.2894
Epoch 1/3, Batch 400/600, Loss: 0.2543
Epoch 1/3, Batch 500/600, Loss: 0.1987
Epoch 1/3 completed, Average Loss: 0.3124
Test Accuracy: 0.9123, Test Loss: 0.2987
--------------------------------------------------
Epoch 2/3, Batch 0/600, Loss: 0.1876
Epoch 2/3, Batch 100/600, Loss: 0.1654
...
Test Accuracy: 0.9456, Test Loss: 0.1876
--------------------------------------------------
Epoch 3/3 completed
Test Accuracy: 0.9567, Test Loss: 0.1456
--------------------------------------------------
Total training time: 45.23 seconds

Interpreting the Results:

Epoch 1: Loss decreases from ~2.3 to ~0.3, accuracy ~91%. The model is learning basic patterns.
Epoch 2: Loss ~0.18, accuracy ~94%. The model is refining its understanding.
Epoch 3: Loss ~0.14, accuracy ~95%. The model has converged to a good solution.

A final accuracy of 95-97% is excellent for this simple network. More complex architectures (CNNs) can achieve 99%+.

Key Concepts Summary

Dataset and DataLoader

Dataset: Abstract base class for data representation, requires __len__ and __getitem__. Use built-in datasets for common tasks or create custom datasets for your own data.
DataLoader: Handles batching, shuffling, and loading data efficiently. Batch training improves computational efficiency and gradient stability.
Transforms: Preprocessing pipeline for data augmentation and normalization. Essential for preparing data for neural network training.

Neural Network Components

nn.Module: Base class for all neural network modules. Manages parameters and provides training infrastructure.
nn.Sequential: Container for stacking layers sequentially. Simplifies forward pass definition.
nn.Flatten: Reshapes multi-dimensional input (images) into 1D vectors for fully connected layers.
nn.Linear: Fully connected layer that learns linear transformations. Core building block of neural networks.
nn.ReLU: Activation function introducing non-linearity. Enables learning of complex patterns.
nn.CrossEntropyLoss: Loss function for multi-class classification. Measures prediction error.
Optimizer (Adam): Algorithm for updating parameters based on gradients. Adam adapts learning rates per parameter.

Training Process

Forward Pass: Compute model predictions by propagating input through the network.
Loss Calculation: Measure difference between predictions and targets using loss function.
Backward Pass: Compute gradients via backpropagation to determine how to adjust parameters.
Parameter Update: Optimizer adjusts parameters using gradients and learning rate.
Epoch: One complete pass through the training dataset. Multiple epochs are needed for convergence.

Evaluation

model.eval(): Set model to evaluation mode (disables dropout, etc.).
argmax: Get predicted class from output logits by selecting the highest score.
Accuracy: Percentage of correct predictions. Test accuracy measures generalization to unseen data.
Overfitting: When training accuracy is much higher than test accuracy. Use regularization to prevent it.

Module Class and Containers

All neural network modules in Riemann inherit from the nn.Module class, which is the foundation for building neural networks. This section details the core functionality, parameter management, and usage methods of various container classes.

Module Class Core Functionality

The nn.Module class provides the following core functionalities:

Parameter Management: Automatically tracks and manages learnable parameters
Submodule Management: Supports nested submodules, forming hierarchical structures
Device Management: Supports moving modules to different devices (CPU/GPU)
Forward Propagation: Defines the data flow path through the network
State Management: Supports training/evaluation mode switching
Hook Management: Supports registering forward/backward hooks for debugging, feature extraction, and gradient modification

Module Class Main Methods

Module Class Main Methods
Method Name	Description	Usage Example
`__init__()`	Initialize the module, create core data structures	`super(MyModule, self).__init__()`
`forward(args, *kwargs)`	Define forward propagation logic, must be implemented by subclasses	`def forward(self, x): return self.layer(x)`
`__call__(args, *kwargs)`	Module call interface, internally calls forward method	`output = model(input_data)`
`parameters(recurse=True)`	Return iterator over all parameters	`for param in model.parameters(): print(param.shape)`
`named_parameters(prefix='', recurse=True)`	Return iterator over named parameters	`for name, param in model.named_parameters(): print(name, param.shape)`
`buffers(recurse=True)`	Return iterator over all buffers	`for buffer in model.buffers(): print(buffer.shape)`
`named_buffers(prefix='', recurse=True)`	Return iterator over named buffers	`for name, buffer in model.named_buffers(): print(name, buffer.shape)`
`children()`	Return iterator over direct submodules	`for child in model.children(): print(child)`
`modules()`	Return iterator over all submodules (including self)	`for module in model.modules(): print(module)`
`named_modules(prefix='', recurse=True)`	Return iterator over named modules	`for name, module in model.named_modules(): print(name, module)`
`train(mode=True)`	Set module to training mode	`model.train()`
`eval()`	Set module to evaluation mode	`model.eval()`
`to(device)`	Move module to specified device	`model.to('cuda')`
`cuda()`	Move module to CUDA device	`model.cuda()`
`cpu()`	Move module to CPU device	`model.cpu()`
`zero_grad(set_to_none=False)`	Clear gradients of all parameters	`model.zero_grad()`
`requires_grad_(requires_grad=True)`	Set whether parameters require gradients	`model.requires_grad_(False) # Freeze parameters`
`state_dict(destination=None, prefix='', keep_vars=False)`	Return module state dictionary	`state = model.state_dict()`
`load_state_dict(state_dict)`	Load state dictionary into module	`model.load_state_dict(state)`
`register_parameter(name, param)`	Register parameter to module	`self.register_parameter('weight', nn.Parameter(rm.randn(10, 5)))`
`register_buffer(name, tensor)`	Register buffer to module	`self.register_buffer('running_mean', rm.zeros(10))`
`add_module(name, module)`	Explicitly add submodule	`self.add_module('linear', nn.Linear(10, 5))`
`register_forward_pre_hook(hook)`	Register forward pre-hook	`handle = model.register_forward_pre_hook(my_hook)`
`register_forward_hook(hook)`	Register forward post-hook	`handle = model.register_forward_hook(my_hook)`
`register_full_backward_pre_hook(hook)`	Register backward pre-hook	`handle = model.register_full_backward_pre_hook(my_hook)`
`register_full_backward_hook(hook)`	Register backward post-hook	`handle = model.register_full_backward_hook(my_hook)`
`apply(fn)`	Recursively apply function to all submodules	`model.apply(init_weights)`
`get_parameter(target)`	Get parameter by name	`param = model.get_parameter('layer1.weight')`
`get_submodule(target)`	Get submodule by name	`module = model.get_submodule('layer1.conv1')`
`get_buffer(target)`	Get buffer by name	`buffer = model.get_buffer('bn1.running_mean')`
`has_parameter(target)`	Check if parameter exists	`if model.has_parameter('weight'): ...`
`has_buffer(target)`	Check if buffer exists	`if model.has_buffer('running_mean'): ...`
`set_parameter(name, param)`	Set parameter by name	`model.set_parameter('weight', new_param)`
`set_buffer(name, tensor)`	Set buffer by name	`model.set_buffer('running_mean', new_tensor)`
`delete_parameter(target)`	Delete parameter by name	`model.delete_parameter('old_weight')`
`delete_buffer(target)`	Delete buffer by name	`model.delete_buffer('old_buffer')`
`copy()`	Create shallow copy of module	`new_model = model.copy()`
`deepcopy()`	Create deep copy of module	`new_model = model.deepcopy()`

Creating Custom Modules

import riemann as rm
import riemann.nn as nn

class MyNetwork(nn.Module):
    def __init__(self):
        super(MyNetwork, self).__init__()
        # Define submodules
        self.linear1 = nn.Linear(10, 50)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(50, 1)

    def forward(self, x):
        # Define forward propagation logic
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        return x

# Create instance
model = MyNetwork()
print(model)

Container Classes

Riemann provides several container classes to organize and manage modules:

Sequential

The Sequential container executes modules in sequence, suitable for simple linear network structures:

Parameters:

Accepts module lists or keyword arguments

Usage Example:

import riemann as rm
import riemann.nn as nn

# Method 1: Using module list
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 5)
)

# Method 2: Using keyword arguments
model = nn.Sequential(
    linear1=nn.Linear(10, 20),
    relu=nn.ReLU(),
    linear2=nn.Linear(20, 5)
)

# Forward pass
x = rm.randn(32, 10)
output = model(x)
print(output.shape)  # [32, 5]

ModuleList

The ModuleList container stores module lists, allowing access by index, suitable for scenarios requiring dynamic control of forward propagation:

Parameters:

modules: Module list (optional)

Main Methods:

append(module): Add module
extend(modules): Extend module list
insert(index, module): Insert module

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create module list
layers = nn.ModuleList([
    nn.Linear(10, 20),
    nn.ReLU()
])

# Add more modules
layers.append(nn.Linear(20, 10))
layers.append(nn.ReLU())
layers.append(nn.Linear(10, 5))

# Forward pass
x = rm.randn(32, 10)
for i, layer in enumerate(layers):
    x = layer(x)
    print(f"After layer {i}: {x.shape}")

print(f"Final output shape: {x.shape}")  # [32, 5]

ModuleDict

The ModuleDict container uses a dictionary to store modules, allowing access by key, suitable for scenarios requiring selection of different modules based on conditions:

Parameters:

modules: Module dictionary (optional)

Main Methods:

update(modules): Update module dictionary
pop(key): Remove and return module with specified key

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create module dictionary
layers = nn.ModuleDict({
    'linear1': nn.Linear(10, 20),
    'relu': nn.ReLU(),
    'linear2': nn.Linear(20, 5)
})

# Add new module
layers.update({'dropout': nn.Dropout(p=0.5)})

# Forward pass
x = rm.randn(32, 10)
x = layers['linear1'](x)
x = layers['relu'](x)
x = layers['dropout'](x)
x = layers['linear2'](x)

print(x.shape)  # [32, 5]

ParameterList

The ParameterList container is specifically designed for storing parameter lists:

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create parameter list
params = nn.ParameterList([
    nn.Parameter(rm.randn(10, 20)),
    nn.Parameter(rm.randn(20))
])

# Add more parameters
params.append(nn.Parameter(rm.randn(20, 5)))

# Index access
weight = params[0]
bias = params[1]

ParameterDict

The ParameterDict container is specifically designed for storing parameter dictionaries:

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create parameter dictionary
params = nn.ParameterDict({
    'w1': nn.Parameter(rm.randn(10, 20)),
    'b1': nn.Parameter(rm.randn(20)),
    'w2': nn.Parameter(rm.randn(20, 5)),
    'b2': nn.Parameter(rm.randn(5))
})

# Access by key
weight1 = params['w1']
bias1 = params['b1']

Activation Functions

Activation functions are important components in neural networks, introducing non-linear characteristics that enable networks to learn complex function mappings.

Activation Function List

Activation Functions Supported by Riemann
Function Name	Description	Application Scenarios	Parameter Meanings	Notes
`ReLU`	Rectified Linear Unit, outputs max(0, x)	Default choice for most deep learning models	No parameters	May produce “dying neuron” problem
`LeakyReLU`	ReLU with leak, small slope in negative region	Solving ReLU’s dying neuron problem	`negative_slope`: Slope in negative region, default 0.01	Slightly higher computational cost than ReLU
`RReLU`	Randomized Leaky ReLU, random slope during training	Provides regularization effect	`lower`: Lower bound of slope, default 1/8 `upper`: Upper bound of slope, default 1/3	Random during training, fixed during evaluation
`PReLU`	Parametric ReLU, learnable slope	Scenarios requiring adaptive negative slope	`num_parameters`: Number of parameters `init`: Initial value, default 0.25	Adds few parameters, stronger expressiveness
`Sigmoid`	S-shaped activation function, outputs (0, 1)	Output layer in binary classification tasks	No parameters	Suffers from gradient vanishing problem
`Tanh`	Hyperbolic tangent function, outputs (-1, 1)	RNN and other sequence models	No parameters	Zero-centered, converges faster than Sigmoid
`Softmax`	Normalized exponential function, outputs probability distribution	Output layer in multi-class classification tasks	`dim`: Calculation dimension, default -1	Usually used with cross-entropy loss
`LogSoftmax`	Logarithm of Softmax	Multi-class tasks, used with NLLLoss	`dim`: Calculation dimension, default -1	Better numerical stability
`GELU`	Gaussian Error Linear Unit	Default choice in Transformer models	No parameters	Higher computational cost
`ELU`	Exponential Linear Unit	Scenarios requiring zero-centered output	`alpha`: Negative saturation parameter, default 1.0	Output mean close to zero
`CELU`	Continuously Differentiable Exponential Linear Unit	Scenarios requiring smooth gradients	`alpha`: Formula parameter, default 1.0	Continuously differentiable at x=0
`SELU`	Scaled Exponential Linear Unit	Deep networks, self-normalizing scenarios	No parameters	Self-normalization with proper initialization
`SiLU`	Sigmoid Linear Unit (Swish)	Modern deep networks	No parameters	Smooth non-monotonic, excellent performance
`Softplus`	Softplus function, smooth approximation of ReLU	Scenarios requiring smooth activation	`beta`: Smoothness parameter, default 1.0 `threshold`: Threshold, default 20	Differentiable everywhere, no hard threshold

Loss Functions

Loss functions are used to measure the difference between model predictions and true target values, and are core components of model training.

Loss Function List

Loss Functions Supported by Riemann
Function Name	Description	Application Scenarios	Parameter Meanings	Notes
`MSELoss`	Mean Squared Error loss	Regression tasks	`reduction`: Aggregation method, default ‘mean’	Sensitive to outliers
`L1Loss`	L1 loss (absolute error)	Regression tasks insensitive to outliers	`reduction`: Aggregation method, default ‘mean’	Gradient discontinuous at origin
`SmoothL1Loss`	Smooth L1 loss (Huber loss)	Regression tasks robust to outliers	`beta`: Threshold, default 1.0 `reduction`: Aggregation method, default ‘mean’	Combines advantages of L1 and L2
`CrossEntropyLoss`	Cross entropy loss, combining log_softmax and nll_loss	Multi-class classification tasks	`weight`: Class weights `ignore_index`: Ignored target value `reduction`: Aggregation method, default ‘mean’	Input is raw logits, no need for softmax
`NLLLoss`	Negative Log Likelihood loss	Multi-class tasks, used with LogSoftmax	`weight`: Class weights `ignore_index`: Ignored target value `reduction`: Aggregation method, default ‘mean’	Input should be log probabilities
`BCEWithLogitsLoss`	Binary cross entropy loss with logits	Binary classification tasks	`weight`: Sample weights `pos_weight`: Positive class weight `reduction`: Aggregation method, default ‘mean’	Input is raw logits, no need for sigmoid
`HuberLoss`	Huber loss, robust to outliers	Regression tasks sensitive to outliers	`delta`: Threshold, default 1.0	Moderate computational cost

Initialization Module

Riemann’s initialization module (riemann.nn.init) provides a series of utility functions for initializing neural network parameters, maintaining interface consistency with PyTorch’s nn.init module. Proper parameter initialization is crucial for neural network training, helping models converge faster and achieve better performance.

Main Features:

Basic Initialization: Uniform distribution, normal distribution, constant, zeros, identity matrix, etc.
Advanced Initialization: Xavier (Glorot) initialization, Kaiming (He) initialization, orthogonal initialization, etc.
Gain Calculation: Calculate recommended gain values based on activation functions

Usage:

import riemann as rm
from riemann import nn

# Create a tensor
w = rm.empty(3, 5)

# Use Xavier uniform initialization
nn.init.xavier_uniform_(w)

# Use Kaiming normal initialization (for ReLU)
nn.init.kaiming_normal_(w, mode='fan_in', nonlinearity='relu')

Initialization Functions List

Initialization Functions Supported by Riemann
Function Name	Description	Usage Scenarios	Parameters
`uniform_`	Uniform distribution initialization	General weight initialization	`a`: Lower bound, default 0.0 `b`: Upper bound, default 1.0
`normal_`	Normal distribution initialization	General weight initialization	`mean`: Mean, default 0.0 `std`: Standard deviation, default 1.0
`trunc_normal_`	Truncated normal distribution initialization	Initialization with value range constraints	`mean`: Mean, default 0.0 `std`: Standard deviation, default 1.0 `a`: Lower bound, default -2.0 `b`: Upper bound, default 2.0
`constant_`	Constant initialization	Bias initialization	`val`: Fill value
`ones_`	All-ones initialization	Specific layer initialization	No parameters
`zeros_`	All-zeros initialization	Initialize bias to zero	No parameters
`eye_`	Identity matrix initialization	Preserve input features in linear layers	No parameters (2D tensors only)
`dirac_`	Dirac delta initialization	Preserve input channels in convolutional layers	`groups`: Number of groups, default 1
`xavier_uniform_`	Xavier uniform initialization	Symmetric activation functions (Sigmoid/Tanh)	`gain`: Gain factor, default 1.0
`xavier_normal_`	Xavier normal initialization	Symmetric activation functions	`gain`: Gain factor, default 1.0
`kaiming_uniform_`	Kaiming uniform initialization	ReLU and its variants	`a`: Negative slope, default 0 `mode`: ‘fan_in’ or ‘fan_out’ `nonlinearity`: Activation function name
`kaiming_normal_`	Kaiming normal initialization	ReLU and its variants	Same as `kaiming_uniform_`
`orthogonal_`	Orthogonal initialization	RNN and sequence models	`gain`: Gain factor, default 1.0
`sparse_`	Sparse initialization	Scenarios requiring sparse weights	`sparsity`: Sparsity ratio `std`: Standard deviation, default 0.01
`calculate_gain`	Calculate gain value	Compute scaling factor for custom initialization	`nonlinearity`: Activation function name `param`: Optional parameter

Gain Value Reference Table

The calculate_gain function returns recommended gain values based on activation functions:

Gain Values for Different Activation Functions
Activation Function	Gain Value	Description
Linear / Identity	1	Linear transformation
Conv{1,2,3}D	1	Convolutional layers
Sigmoid	1	S-shaped activation
Tanh	5/3	Hyperbolic tangent
ReLU	sqrt(2)	Rectified Linear Unit
Leaky ReLU	sqrt(2/(1+negative_slope^2))	Leaky ReLU
SELU	3/4	Self-Normalizing ELU

Usage Examples

Example 1: Xavier Initialization for Linear Layer

import riemann as rm
from riemann import nn

# Create linear layer
linear = nn.Linear(784, 256)

# Xavier uniform initialization (for Tanh/Sigmoid)
nn.init.xavier_uniform_(linear.weight)
nn.init.zeros_(linear.bias)

Example 2: Kaiming Initialization for Convolutional Layer

import riemann as rm
from riemann import nn

# Create convolutional layer
conv = nn.Conv2d(3, 64, kernel_size=3)

# Kaiming normal initialization (for ReLU)
nn.init.kaiming_normal_(conv.weight, mode='fan_out', nonlinearity='relu')
nn.init.zeros_(conv.bias)

Example 3: Custom Initialization

import riemann as rm
from riemann import nn
import math

# Create tensor
w = rm.empty(256, 128)

# Calculate gain value
gain = nn.init.calculate_gain('leaky_relu', 0.2)

# Use calculated gain for Xavier initialization
nn.init.xavier_uniform_(w, gain=gain)

Basic Network Layers

Linear Layer (Linear)

Linear layer (also known as fully connected layer) performs affine transformation on input data. It is one of the most fundamental layers in neural networks.

Purpose:

Implements linear transformation: output = input @ weight.T + bias
Commonly used for feature transformation, final classification layer, and dimension conversion in networks
Basic building block for constructing Multi-Layer Perceptrons (MLP)

Parameters:

in_features: Input feature dimension
out_features: Output feature dimension
bias: Whether to use bias, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create linear layer
linear = nn.Linear(in_features=20, out_features=10)

# Forward pass
x = rm.randn(32, 20)
output = linear(x)
print(output.shape)  # [32, 10]

Dropout Layer

Dropout layer prevents overfitting by randomly deactivating neurons. It is a commonly used regularization technique.

Purpose:

Prevents neural network overfitting and improves model generalization
Randomly sets some neuron outputs to zero during training, forcing the network to learn more robust feature representations
Commonly used after fully connected layers, especially in deep networks

Parameters:

p: Dropout probability, default 0.5, representing the probability of each neuron being dropped

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create dropout layer
dropout = nn.Dropout(p=0.5)

# Forward pass (training mode)
x = rm.randn(4, 16)
dropout.train()
output_train = dropout(x)

# Forward pass (evaluation mode)
dropout.eval()
output_eval = dropout(x)

Dropout2d Layer

Dropout2d layer randomly drops entire feature maps at the channel level, suitable for convolutional neural networks.

Purpose:

Specifically designed for regularization of 2D convolutional feature maps (shape (N, C, H, W))
Drops entire channels randomly rather than individual pixels, preserving spatial correlation of feature maps
Commonly used after convolutional layers to prevent overfitting in CNNs

Parameters:

p: Dropout probability, default 0.5

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create Dropout2d layer
dropout2d = nn.Dropout2d(p=0.5)

# Forward pass (input shape [N, C, H, W])
x = rm.randn(4, 16, 32, 32)
dropout2d.train()
output = dropout2d(x)
print(output.shape)  # [4, 16, 32, 32]

Dropout3d Layer

Dropout3d layer randomly drops entire 3D feature maps at the channel level, suitable for 3D convolutional neural networks.

Purpose:

Specifically designed for regularization of 3D convolutional feature maps (shape (N, C, D, H, W))
Drops entire 3D feature volumes at the channel level
Commonly used in 3D convolutional networks for video processing, 3D medical imaging, etc.

Parameters:

p: Dropout probability, default 0.5

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create Dropout3d layer
dropout3d = nn.Dropout3d(p=0.5)

# Forward pass (input shape [N, C, D, H, W])
x = rm.randn(4, 16, 8, 32, 32)
dropout3d.train()
output = dropout3d(x)
print(output.shape)  # [4, 16, 8, 32, 32]

Flatten Layer

Flatten layer flattens the input tensor within a specified dimension range.

Purpose:

Flattens multi-dimensional tensors into 1D or lower-dimensional tensors, commonly used to connect convolutional and fully connected layers
Preserves batch dimension while merging spatial and channel dimensions into feature vectors
Bridge between convolutional and fully connected parts in CNN architectures

Parameters:

start_dim: Starting dimension for flattening, default 1
end_dim: Ending dimension for flattening, default -1

Usage Example:

import riemann as rm
import riemann.nn as nn

flatten = nn.Flatten()

# Flatten (batch, 1, 28, 28) to (batch, 784)
x = rm.randn(32, 1, 28, 28)
output = flatten(x)
print(output.shape)  # [32, 784]

BatchNorm1d Layer

1D batch normalization layer, normalizes the channel dimension for 2D or 3D inputs.

Purpose:

Accelerates neural network training convergence and allows larger learning rates
Reduces sensitivity to initialization and improves training stability
Provides some regularization effect, reducing dependence on Dropout
Commonly used after fully connected layers or 1D convolutional layers

Parameters:

num_features: Number of features (channel count C)
eps: Small constant for numerical stability, default 1e-5
momentum: Momentum for running statistics, default 0.1
affine: Whether to use learnable affine parameters, default True
track_running_stats: Whether to track running mean and variance, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create BatchNorm1d layer
bn = nn.BatchNorm1d(num_features=100)

# 2D input (N, C)
x = rm.randn(20, 100)
output = bn(x)
print(output.shape)  # [20, 100]

# 3D input (N, C, L)
x = rm.randn(20, 100, 35)
output = bn(x)
print(output.shape)  # [20, 100, 35]

BatchNorm2d Layer

2D batch normalization layer, normalizes the channel dimension for 4D inputs (N, C, H, W).

Purpose:

Specifically designed for 2D convolutional neural networks, normalizes each channel’s feature map
Accelerates CNN training and improves model generalization
Key component for building modern CNNs (e.g., ResNet, DenseNet)
Usually placed after convolutional layers and before activation functions

Parameters:

num_features: Number of features (channel count C)
eps: Small constant for numerical stability, default 1e-5
momentum: Momentum for running statistics, default 0.1
affine: Whether to use learnable affine parameters, default True
track_running_stats: Whether to track running mean and variance, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create BatchNorm2d layer
bn = nn.BatchNorm2d(num_features=64)

# 4D input (N, C, H, W)
x = rm.randn(16, 64, 32, 32)
output = bn(x)
print(output.shape)  # [16, 64, 32, 32]

BatchNorm3d Layer

3D batch normalization layer, normalizes the channel dimension for 5D inputs (N, C, D, H, W).

Purpose:

Specifically designed for 3D convolutional neural networks, such as video processing and 3D medical image analysis
Normalizes 3D feature volumes for each channel
Important component of 3D CNN architectures (e.g., C3D, I3D)

Parameters:

num_features: Number of features (channel count C)
eps: Small constant for numerical stability, default 1e-5
momentum: Momentum for running statistics, default 0.1
affine: Whether to use learnable affine parameters, default True
track_running_stats: Whether to track running mean and variance, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create BatchNorm3d layer
bn = nn.BatchNorm3d(num_features=32)

# 5D input (N, C, D, H, W)
x = rm.randn(8, 32, 4, 16, 16)
output = bn(x)
print(output.shape)  # [8, 32, 4, 16, 16]

LayerNorm Layer

Layer normalization layer, normalizes all features of a single sample.

Purpose:

Normalizes features of individual samples without relying on batch statistics
Suitable for scenarios with batch size of 1 or dynamically changing batch sizes
Core component of Transformer models, used as an alternative to BatchNorm
Widely used in natural language processing tasks

Parameters:

normalized_shape: Dimensions to normalize, can be an integer or tuple
eps: Small constant for numerical stability, default 1e-5
affine: Whether to use learnable affine parameters, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create LayerNorm layer
ln = nn.LayerNorm(normalized_shape=128)

# Input can be any shape, last dimension must match normalized_shape
x = rm.randn(20, 128)
output = ln(x)
print(output.shape)  # [20, 128]

# Multi-dimensional input
x = rm.randn(20, 10, 128)
output = ln(x)
print(output.shape)  # [20, 10, 128]

Embedding Layer

Embedding layer, converts integer indices to fixed-size dense vector representations.

Purpose:

Maps discrete integer indices (such as word indices) to continuous vector representations
Basic component for processing categorical features and sequential data (such as text, user IDs)
Used as word embedding layer in NLP tasks
Supports padding index (padding_idx) not participating in gradient computation

Parameters:

num_embeddings: Number of embedding vectors (vocabulary size)
embedding_dim: Dimension of each embedding vector
padding_idx: Padding index, embedding vectors at this index do not participate in gradient computation, default None
max_norm: Maximum norm of embedding vectors, re-normalized if exceeded, default None
norm_type: p-value for norm calculation, default 2 (L2 norm)
scale_grad_by_freq: Whether to scale gradients by frequency, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create Embedding layer, vocabulary size 10000, embedding dimension 128
embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)

# Input is integer indices
input_indices = rm.tensor([1, 5, 10, 100])
output = embedding(input_indices)
print(output.shape)  # [4, 128]

# Using padding_idx
embedding_with_pad = nn.Embedding(10000, 128, padding_idx=0)
input_with_pad = rm.tensor([0, 1, 2, 0])  # 0 is the padding index
output = embedding_with_pad(input_with_pad)

Module Hook Management

Riemann provides a powerful module hook mechanism that allows users to insert custom logic during the forward and backward propagation of modules. The hook mechanism is a powerful tool for debugging, monitoring, and modifying network behavior.

Hook Types Overview

Riemann supports four types of module hooks, executed at different stages of forward and backward propagation:

Hook Type	Registration Method	Execution Timing	Modifiable Value
Forward Pre-Hook	`register_forward_pre_hook`	Called before `forward` method execution	Module input (`input`)
Forward Hook	`register_forward_hook`	Called after `forward` method execution	Module output (`output`)
Full Backward Pre-Hook	`register_full_backward_pre_hook`	Called when all module outputs requiring gradients have received gradients	Output gradients (`grad_output`)
Full Backward Hook	`register_full_backward_hook`	Called when all module inputs requiring gradients have received gradients	Input gradients (`grad_input`)

Hook Execution Order

Hook execution order during forward propagation:

register_forward_pre_hook → forward → register_forward_hook

Hook execution order during backward propagation:

register_full_backward_pre_hook → (compute grad_input) → register_full_backward_hook

Forward Pre-Hook (register_forward_pre_hook)

Purpose:

Modify or inspect input data before the module’s forward computation
Implement input preprocessing, data validation, or debug information printing
Commonly used for dynamically adjusting input ranges, adding noise, or recording intermediate states

Hook Function Signature:

hook(module, input) -> None or modified input

Parameters:

module: The module instance being called
input: A tuple containing all input tensors (even a single input is wrapped in a tuple)

Return Value:

None: Indicates no modification to input, continue execution with original input
Tensor or tuple: Returns modified input, which will replace the original input passed to forward

Usage Example:

import riemann as rm
import riemann.nn as nn

# Define forward pre-hook: print input information
def print_input_hook(module, input):
    print(f"Input shape for module {module._get_name()}: {input[0].shape}")
    return None  # Do not modify input

# Define forward pre-hook: modify input
def double_input_hook(module, input):
    # Multiply input by 2
    return (input[0] * 2,)

# Create linear layer and register hooks
linear = nn.Linear(10, 5)
handle1 = linear.register_forward_pre_hook(print_input_hook)
handle2 = linear.register_forward_pre_hook(double_input_hook)

# Forward propagation
x = rm.ones(2, 10)
output = linear(x)  # Actually uses x * 2

# Remove hooks
handle1.remove()
handle2.remove()

Forward Hook (register_forward_hook)

Purpose:

Modify or inspect output data after the module’s forward computation
Implement feature extraction, output monitoring, and debugging
Commonly used for recording intermediate layer features and analyzing activation distributions

Hook Function Signature:

hook(module, input, output) -> None or modified output

Parameters:

module: The module instance being called
input: A tuple containing all input tensors passed to forward
- Always a tuple: Even for single-input modules, input is a tuple with one element: (input_tensor,)
- Multi-input modules: (input1, input2, ...)
- Note: If a forward pre-hook modified the input, this will be the modified version, not the original input
output: The return value of forward method
- Single-output modules: A single tensor
- Multi-output modules: A tuple of tensors (output1, output2, ...)

Return Value:

None: Indicates no modification to output, use original output as module return value
Tensor or tuple: Returns modified output, which will replace the original output
- For single-output modules, return a tensor
- For multi-output modules, return a tuple with the same structure

Usage Example:

import riemann as rm
import riemann.nn as nn

# Define forward hook: feature extractor
class FeatureExtractor:
    def __init__(self):
        self.features = []

    def hook(self, module, input, output):
        self.features.append(output.clone())
        return None

# Create model and register feature extraction hook
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

extractor = FeatureExtractor()
handle = model[0].register_forward_hook(extractor.hook)

# Forward propagation
x = rm.randn(4, 784)
output = model(x)

# View extracted features
print(f"First layer output shape: {extractor.features[0].shape}")

# Remove hook
handle.remove()

Full Backward Pre-Hook (register_full_backward_pre_hook)

Purpose:

Modify or inspect output gradients (grad_output) at the start of backward propagation
Implement gradient clipping, gradient scaling, or gradient monitoring
Commonly used for preventing gradient explosion and adjusting gradient flow

Hook Function Signature:

hook(module, grad_output) -> None or modified grad_output

Parameters:

module: The module instance in backward propagation
grad_output: A tuple containing all output gradients
- Single-output module: (grad_output_tensor,)
- Multi-output module: (grad_output1, grad_output2, ...)
- For outputs that don’t require gradients, the corresponding position is None

Return Value:

None: Indicates no modification to gradients, continue computation with original grad_output
tuple: Returns modified grad_output, which will be used for subsequent gradient computation

Important: If you only want to modify some gradients, the returned tuple must contain all output gradients. For positions you don’t want to modify, return the original gradient value; if you return None for a position, that gradient will be zeroed (set to 0)

Usage Example:

import riemann as rm
import riemann.nn as nn

# Define backward pre-hook: gradient clipping
def clip_grad_hook(module, grad_output):
    # Clip gradients to prevent explosion
    clipped = tuple(
        g.clip(-1, 1) if g is not None else None
        for g in grad_output
    )
    return clipped

# Define backward pre-hook: print gradient information
def print_grad_hook(module, grad_output):
    print(f"Output gradient shape: {grad_output[0].shape}")
    print(f"Output gradient value range: [{grad_output[0].min()}, {grad_output[0].max()}]")
    return None

# Create linear layer and register hook
linear = nn.Linear(10, 5)
handle = linear.register_full_backward_pre_hook(clip_grad_hook)

# Forward and backward propagation
x = rm.randn(2, 10)
x.requires_grad = True
output = linear(x)
output.sum().backward()  # Gradients will be clipped to [-1, 1] range

# Remove hook
handle.remove()

Full Backward Hook (register_full_backward_hook)

Purpose:

Modify or inspect input gradients (grad_input) at the end of backward propagation
Implement gradient monitoring, debugging, and visualization
Commonly used for analyzing gradient flow and detecting vanishing or exploding gradients

Hook Function Signature:

hook(module, grad_input, grad_output) -> None or modified grad_input

Parameters:

module: The module instance in backward propagation
grad_input: A tuple containing all input gradients
- Single-input module: (grad_input_tensor,)
- Multi-input module: (grad_input1, grad_input2, ...)
- For inputs that don’t require gradients, the corresponding position is None
grad_output: A tuple containing all output gradients
- Note: If a backward pre-hook modified the gradients, this will be the modified version

Return Value:

None: Indicates no modification to gradients, continue propagation with original grad_input
tuple: Returns modified grad_input, which will replace the original gradients propagated to the previous layer

Important: If you only want to modify some gradients, the returned tuple must contain all input gradients. For positions you don’t want to modify, return the original gradient value; if you return None for a position, that gradient will be zeroed (set to 0)
Note

This behavior differs from PyTorch. In PyTorch, returning None for a position keeps the gradient as None. Riemann chooses to zero out the gradient for the following reasons:
1. Semantic Consistency: Consistent with backward pre-hook behavior (returning None means zeroing)
2. Practicality: Zeroing is an intuitive way to block gradient propagation, while None requires extra handling
3. Safety: A gradient of 0 is a valid numeric value that won’t cause errors in subsequent computations

Usage Example:

import riemann as rm
import riemann.nn as nn

# Define backward hook: gradient monitor
class GradientMonitor:
    def __init__(self):
        self.gradients = []

    def hook(self, module, grad_input, grad_output):
        self.gradients.append({
            'module': module._get_name(),
            'grad_input': [g.clone() if g is not None else None for g in grad_input],
            'grad_output': [g.clone() if g is not None else None for g in grad_output]
        })
        return None

# Create model and register gradient monitoring hooks
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

monitor = GradientMonitor()
for layer in model:
    layer.register_full_backward_hook(monitor.hook)

# Forward and backward propagation
x = rm.randn(4, 784)
x.requires_grad = True
output = model(x)
output.sum().backward()

# View recorded gradient information
for grad_info in monitor.gradients:
    print(f"Module: {grad_info['module']}")
    print(f"Input gradient shapes: {[g.shape if g is not None else None for g in grad_info['grad_input']]}")

Hook Registration and Removal

Registering Hooks:

All hook registration methods return a RemovableHandle object that can be used to remove the hook later:

# Register hook and get handle
handle = module.register_forward_hook(hook_function)

# Remove hook using handle
handle.remove()

Using Context Managers:

RemovableHandle supports the context manager protocol, allowing automatic management of hook lifecycle using with statements:

with module.register_forward_hook(hook_function) as handle:
    # Hook is active within this scope
    output = module(input)
    # Hook is automatically removed when exiting the with block

Managing Multiple Hooks:

A module can register multiple hooks of the same type, which are executed in registration order:

def hook1(module, input):
    print("Hook 1")
    return None

def hook2(module, input):
    print("Hook 2")
    return None

module.register_forward_pre_hook(hook1)
module.register_forward_pre_hook(hook2)

# Execution order: hook1 -> hook2

Typical Application Scenarios

1. Feature Visualization

Feature visualization is a common technique in deep learning to understand what patterns a neural network learns at different layers. By registering forward hooks on convolutional layers, you can capture and visualize intermediate feature maps.

Use Cases:

Visualizing what features different convolutional filters detect (edges, textures, shapes)
Debugging model behavior by inspecting intermediate representations
Creating feature maps for research or presentation purposes

Example: Capturing and visualizing feature maps from a CNN (using real MNIST data)

import riemann.nn as nn
from riemann.vision.datasets import EasyMNIST
from riemann.utils import get_data_root
import matplotlib.pyplot as plt

# Load MNIST dataset
print("Loading MNIST dataset...")
train_dataset = EasyMNIST(root=get_data_root(), train=True, onehot_label=False)

# Get a sample (handwritten digit image)
sample_data, sample_label = train_dataset[0]
print(f"Sample label: {int(sample_label)}")

# Reshape flattened data back to 28x28 image
sample_image = sample_data.reshape(28, 28)

# Create a simple CNN for demonstration
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.fc = nn.Linear(32 * 28 * 28, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.conv2(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

# Dictionary to store activations
activations = {}

def get_activation(name):
    """Create a hook function that saves activations"""
    def hook(module, input, output):
        # Detach to avoid saving computation graph
        activations[name] = output.detach()
    return hook

# Create model and register hooks
model = SimpleCNN()
model.conv1.register_forward_hook(get_activation('conv1'))
model.conv2.register_forward_hook(get_activation('conv2'))

# Forward pass with MNIST sample
# Reshape to [batch_size, channels, height, width]
input_image = sample_image.unsqueeze(0).unsqueeze(0)  # [1, 1, 28, 28]
output = model(input_image)

# Now activations['conv1'] contains the feature maps from conv1 layer
# Shape: [1, 16, 28, 28] - 16 feature maps of size 28x28
print(f"Conv1 activations shape: {activations['conv1'].shape}")
print(f"Model prediction: {output.argmax(dim=1).item()}")

# Visualize the first 8 feature maps from conv1
fig, axes = plt.subplots(2, 4, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    ax.imshow(activations['conv1'][0, i].numpy(), cmap='viridis')
    ax.set_title(f'Filter {i}')
    ax.axis('off')
plt.suptitle('Conv1 Layer Feature Maps', fontsize=14)
plt.tight_layout()
plt.show()

2. Gradient Checking

Gradient checking is essential for debugging training issues. Invalid gradients (NaN or Inf values) can cause training to fail silently or produce unexpected results. By using backward hooks, you can monitor gradients in real-time during training.

Use Cases:

Detecting gradient explosion or vanishing gradients early
Identifying which layers produce invalid gradients
Automatically stopping training or adjusting learning rate when issues occur

Example: Comprehensive gradient monitoring with automatic training stop

import riemann as rm
import riemann.nn as nn

class GradientChecker:
    """A comprehensive gradient checker that monitors for various issues"""

    def __init__(self, threshold=1e3):
        self.threshold = threshold  # Threshold for gradient explosion
        self.has_nan_inf = False
        self.layer_stats = {}

    def hook(self, module, grad_input, grad_output):
        module_name = module._get_name()

        # Check for NaN or Inf in grad_output
        for i, g in enumerate(grad_output):
            if g is not None:
                if rm.isnan(g).any():
                    print(f"ERROR: NaN detected in {module_name} grad_output[{i}]")
                    self.has_nan_inf = True
                if rm.isinf(g).any():
                    print(f"ERROR: Inf detected in {module_name} grad_output[{i}]")
                    self.has_nan_inf = True

                # Check for gradient explosion
                grad_norm = g.norm().item()
                if grad_norm > self.threshold:
                    print(f"WARNING: Gradient explosion in {module_name}: norm={grad_norm:.2f}")

        # Check grad_input as well
        for i, g in enumerate(grad_input):
            if g is not None:
                if rm.isnan(g).any() or rm.isinf(g).any():
                    print(f"ERROR: Invalid gradient in {module_name} grad_input[{i}]")
                    self.has_nan_inf = True

        # Store statistics
        self.layer_stats[module_name] = {
            'grad_output_norms': [g.norm().item() if g is not None else 0 for g in grad_output],
            'grad_input_norms': [g.norm().item() if g is not None else 0 for g in grad_input]
        }

        return None  # Don't modify gradients, just monitor

# Usage in training
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

checker = GradientChecker(threshold=100.0)

# Register hooks on all layers
for layer in model:
    layer.register_full_backward_hook(checker.hook)

# Training loop with gradient checking
for epoch in range(10):
    # ... forward pass ...
    # loss = criterion(output, target)
    # loss.backward()

    # Check if gradients are valid before optimizer step
    if checker.has_nan_inf:
        print(f"Epoch {epoch}: Stopping due to invalid gradients")
        break

    # optimizer.step()

3. Weight Statistics Monitoring

Monitoring weight statistics during training helps understand how the network is learning. Sudden changes in weight distribution can indicate issues like poor initialization, learning rate problems, or overfitting.

Use Cases:

Tracking weight distribution changes over training epochs
Detecting dead neurons (weights stuck at zero)
Identifying potential overfitting (weights growing too large)
Validating proper weight initialization

Example: Comprehensive weight and activation monitoring

import riemann as rm
import riemann.nn as nn

class TrainingMonitor:
    """Monitor weights, biases, and activations during training"""

    def __init__(self):
        self.history = []

    def forward_hook(self, module, input, output):
        """Monitor forward pass statistics"""
        stats = {
            'module': module._get_name(),
            'input_mean': input[0].mean().item() if input[0] is not None else 0,
            'output_mean': output.mean().item(),
            'output_std': output.std().item()
        }

        # Monitor weights if available
        if hasattr(module, 'weight') and module.weight is not None:
            w = module.weight.data
            stats.update({
                'weight_mean': w.mean().item(),
                'weight_std': w.std().item(),
                'weight_min': w.min().item(),
                'weight_max': w.max().item(),
                'dead_neurons': (w.abs() < 1e-6).sum().item()  # Near-zero weights
            })

        # Monitor bias if available
        if hasattr(module, 'bias') and module.bias is not None:
            b = module.bias.data
            stats.update({
                'bias_mean': b.mean().item(),
                'bias_std': b.std().item()
            })

        self.history.append(stats)

        # Print warnings for potential issues
        if stats.get('weight_std', 0) > 10:
            print(f"WARNING: {stats['module']} weights have high std: {stats['weight_std']:.2f}")
        if stats.get('dead_neurons', 0) > 0:
            print(f"INFO: {stats['module']} has {stats['dead_neurons']} dead neurons")

    def print_summary(self):
        """Print summary of monitored statistics"""
        print("\n=== Training Monitor Summary ===")
        for stats in self.history[-5:]:  # Show last 5 records
            print(f"\n{stats['module']}:")
            if 'weight_mean' in stats:
                print(f"  Weight: mean={stats['weight_mean']:.4f}, std={stats['weight_std']:.4f}")
            if 'bias_mean' in stats:
                print(f"  Bias: mean={stats['bias_mean']:.4f}, std={stats['bias_std']:.4f}")
            print(f"  Activation: mean={stats['output_mean']:.4f}, std={stats['output_std']:.4f}")

# Usage example
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)

monitor = TrainingMonitor()

# Register forward hooks on all linear layers
for name, layer in model.named_modules():
    if isinstance(layer, nn.Linear):
        layer.register_forward_hook(monitor.forward_hook)

# During training, statistics are automatically collected
# After training, view the summary
# monitor.print_summary()

Important Notes

Hook Return Values

If a hook doesn’t need to modify data, it should return None to avoid unnecessary side effects. When modification is needed, return a tensor or tuple with the same structure as the input.
Multi-Input/Multi-Output Module Handling

For modules with multiple inputs or outputs, hooks receive tuples containing all inputs/outputs:
- Multi-input modules: input and grad_input are tuples containing all input tensors/gradients
- Multi-output modules: output and grad_output are tuples containing all output tensors/gradients
- Important:
  - For multi-output modules, backward pre-hooks are called only when all output gradients (outputs with requires_grad=True that participate in loss computation) are ready
  - For multi-input modules, backward hooks are called only when all input gradients (inputs with requires_grad=True) are ready
  This ensures the hook receives complete gradient information
When modifying gradients, always return a complete tuple with the same structure, even if you only modify some elements.
Multiple Hooks Chain

A module can register multiple hooks of the same type, which form a call chain in registration order:
- Forward Pre-Hook Chain: The return value of the previous hook becomes the input to the next hook
  - If a hook returns None: The original input is passed to the next hook
  - If a hook returns non-None: That return value is used as the next hook’s input
  - The last hook’s output determines the final input passed to forward
- Forward Hook Chain: The return value of the previous hook becomes the output to the next hook
  - If a hook returns None: The original output is passed to the next hook
  - If a hook returns non-None: That return value is used as the next hook’s output
  - The last hook’s output determines the module’s final return value
- Backward Pre-Hook Chain: Can modify grad_output
  - The grad_output received by a hook may be the modified version from the previous hook
  - If a hook returns None: The current grad_output is used to compute grad_input
  - If a hook returns non-None: That return value replaces grad_output and is used to compute grad_input
- Backward Hook Chain: Can modify grad_input
  - The grad_input received by a hook is the computed input gradient
  - The grad_output received by a hook may be the modified version from backward pre-hooks
  - If a hook returns None: The original grad_input is propagated to the previous layer
  - If a hook returns non-None: That return value replaces grad_input and is propagated to the previous layer
```
# Multi-output module example
class MultiOutputModule(nn.Module):
    def forward(self, x):
        return x * 2, x * 3  # Two outputs

module = MultiOutputModule()

def grad_hook(module, grad_input, grad_output):
    # grad_output contains gradients for BOTH outputs
    # This hook is called only when both output gradients are ready
    print(f"Output 1 grad shape: {grad_output[0].shape}")
    print(f"Output 2 grad shape: {grad_output[1].shape}")
    return None

module.register_full_backward_hook(grad_hook)
```

Gradient Computation Flow

Understanding the gradient computation flow helps correctly use backward hooks:
- Backward pre-hooks (register_full_backward_pre_hook): Called before grad_input computation. Modifying grad_output affects how gradients are computed for module inputs
- Backward hooks (register_full_backward_hook): Called after grad_input computation. Modifying grad_input affects gradients propagated to previous layers
```
Backward propagation flow:

1. Output gradients arrive from upstream
2. register_full_backward_pre_hook called (can modify grad_output)
3. Compute grad_input using (possibly modified) grad_output
4. register_full_backward_hook called (can modify grad_input)
5. Modified grad_input propagated to previous layers
```
Hook Execution Conditions

Backward hooks have specific execution conditions to ensure meaningful gradient modification:
- At least one module input must require gradients, OR
- The module must have parameters that require gradients
If neither condition is met, backward hooks won’t be called because there’s no gradient to modify.
Performance Considerations
- Hooks add extra function call overhead. For production inference, remove all debugging and monitoring hooks
- Avoid time-consuming operations in hooks, especially in training loops
- When multiple hooks are registered on the same module, they execute sequentially, compounding the overhead
Memory Management
- Be careful about memory leaks when saving tensor references in hooks. Saved tensors retain the computation graph
- Always use .clone() or .detach() to create copies when storing tensors for later analysis
- Cached gradients are automatically cleaned up after backward propagation completes

Interaction with Computational Graph

When modifying gradients in hooks, be aware of the computational graph:

Modified gradients flow into subsequent computations
For gradient clipping, ensure the operation doesn’t break gradient flow
For gradient monitoring, use .detach() to avoid affecting the graph

# Safe gradient clipping (preserves gradient flow)
def safe_clip_hook(module, grad_output):
    clipped = tuple(
        g.clip(-1, 1) if g is not None else None
        for g in grad_output
    )
    return clipped

# Safe gradient monitoring (doesn't affect graph)
def safe_monitor_hook(module, grad_input, grad_output):
    # Detach before storing to avoid memory leak
    stored_grads = [g.detach().clone() if g is not None else None
                   for g in grad_output]
    # ... analyze stored_grads ...
    return None

Convolutional Networks

Convolutional Neural Networks (CNNs) are one of the most important and widely used architectures in deep learning, particularly suitable for processing grid-structured data such as images, videos, and sequential data. Riemann provides a complete set of convolutional network components, including 1D, 2D, and 3D convolution layers and pooling layers.

Convolution Layers

Convolution layers extract local feature patterns by sliding learnable convolutional kernels over input data. Riemann supports three dimensions of convolution operations:

Convolution Layer Types
Convolution Layer	Applicable Data Types	Typical Application Scenarios
`Conv1d`	1D sequential data (N, C, L)	Audio processing, text sequences, time series
`Conv2d`	2D image data (N, C, H, W)	Image classification, object detection, image segmentation
`Conv3d`	3D volumetric data (N, C, D, H, W)	Video analysis, medical imaging, 3D reconstruction

Conv1d Layer

Purpose:

Process 1D sequential data such as audio waveforms, text sequences, and time series
Capture local temporal dependencies and patterns
Used for n-gram feature extraction in natural language processing

Parameters:

in_channels: Number of input channels
out_channels: Number of output channels (number of convolutional kernels)
kernel_size: Size of the convolutional kernel
stride: Convolution stride, default 1
padding: Padding size, default 0
dilation: Dilation rate, default 1
groups: Number of groups, default 1 (standard convolution)
bias: Whether to use bias, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Audio signal processing
conv1d = nn.Conv1d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=1)
audio = rm.randn(8, 1, 1000)  # batch=8, channels=1, samples=1000
output = conv1d(audio)
print(output.shape)  # [8, 16, 1000]

Conv2d Layer

Purpose:

Core component of CNN architecture for extracting local image features
Hierarchical feature extraction from low-level edge features to high-level semantic features
Supports standard convolution, grouped convolution, depthwise separable convolution, etc.

Parameters:

in_channels: Number of input channels (e.g., 3 for RGB images)
out_channels: Number of output channels
kernel_size: Convolutional kernel size (integer or tuple)
stride: Convolution stride, default 1
padding: Padding size, default 0
dilation: Dilation rate for increasing receptive field, default 1
groups: Number of groups, default 1
bias: Whether to use bias, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Standard image convolution
conv2d = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
image = rm.randn(4, 3, 224, 224)  # batch=4, RGB, height=224, width=224
output = conv2d(image)
print(output.shape)  # [4, 64, 224, 224]

Conv3d Layer

Purpose:

Process 3D data such as videos and medical images (MRI, CT)
Capture spatiotemporal features or 3D spatial features
Simultaneously capture temporal and spatial correlations in video analysis

Parameters:

in_channels: Number of input channels
out_channels: Number of output channels
kernel_size: Convolutional kernel size (integer or triple tuple)
stride: Convolution stride, default 1
padding: Padding size, default 0
dilation: Dilation rate, default 1
groups: Number of groups, default 1
bias: Whether to use bias, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Video data processing
conv3d = nn.Conv3d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
video = rm.randn(2, 3, 10, 64, 64)  # batch=2, RGB, frames=10, height=64, width=64
output = conv3d(video)
print(output.shape)  # [2, 16, 10, 64, 64]

Pooling Layers

Pooling layers are used to reduce the spatial dimensions of feature maps, decrease computational cost, and provide translation invariance. Riemann provides three types of pooling operations: max pooling, average pooling, and adaptive pooling.

Pooling Layer Types
Pooling Layer	Operation Type	Characteristics
`MaxPool1d/2d/3d`	Maximum value in window	Preserves salient features, robust to noise
`AvgPool1d/2d/3d`	Average value in window	Smooth downsampling, preserves overall information
`AdaptiveMaxPool1d/2d/3d`	Adaptive max pooling	Auto-computes pooling parameters, fixed output size
`AdaptiveAvgPool1d/2d/3d`	Adaptive average pooling	Auto-computes pooling parameters, fixed output size

Max Pooling Layers

Max pooling layers select the maximum value within the pooling window, preserving the most salient features and providing robustness to noise. Riemann provides both standard max pooling and adaptive max pooling.

Standard Max Pooling

MaxPool1d Layer

Purpose:

Apply 1D max pooling to sequence data, selecting the maximum value within the sliding window
Reduce sequence dimensionality while preserving the most salient features
Provide translation invariance for time series and sequential data

Parameters:

kernel_size: Pooling window size
stride: Pooling stride, defaults to kernel_size
padding: Padding size, default 0
dilation: Dilation rate, default 1
ceil_mode: Whether to use ceiling for output length calculation, default False
return_indices: Whether to return the indices of maximum values, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Sequence data downsampling
maxpool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)
features = rm.randn(4, 16, 100)  # batch=4, channels=16, length=100
output = maxpool(features)
print(output.shape)  # [4, 16, 50]

MaxPool2d Layer

Purpose:

Preserve the most salient features by selecting the maximum value in local regions
Provide translation invariance
Significantly reduce spatial dimensions and subsequent layer computational complexity

Parameters:

kernel_size: Pooling window size
stride: Pooling stride, defaults to kernel_size
padding: Padding size, default 0
dilation: Dilation rate, default 1
ceil_mode: Whether to use ceiling for output size calculation, default False
return_indices: Whether to return the indices of maximum values, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Standard image downsampling
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
features = rm.randn(4, 64, 224, 224)
output = maxpool(features)
print(output.shape)  # [4, 64, 112, 112]

MaxPool3d Layer

Purpose:

Apply 3D max pooling for volumetric data such as video and medical images
Reduce 3D spatial dimensions while preserving the most salient spatiotemporal features
Provide 3D translation invariance

Parameters:

kernel_size: Pooling window size (can be int or tuple of depth, height, width)
stride: Pooling stride, defaults to kernel_size
padding: Padding size, default 0
dilation: Dilation rate, default 1
ceil_mode: Whether to use ceiling for output size calculation, default False
return_indices: Whether to return the indices of maximum values, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Video data downsampling
maxpool = nn.MaxPool3d(kernel_size=2, stride=2)
features = rm.randn(4, 3, 16, 64, 64)  # batch=4, channels=3, frames=16, height=64, width=64
output = maxpool(features)
print(output.shape)  # [4, 3, 8, 32, 32]

Adaptive Max Pooling

Adaptive pooling layers automatically compute the pooling kernel size and stride based on the specified output size, ensuring the output dimensions are always fixed without manual calculation of pooling parameters.

AdaptiveMaxPool1d Layer

Purpose:

Apply 1D adaptive max pooling to sequence data
Preserve the most salient features in sequences while mapping to fixed length
Suitable for sequence tasks requiring preservation of maximum value information

Parameters:

output_size: Output sequence length
return_indices: Whether to return the indices of maximum values, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Adaptive max pooling
adaptive_pool = nn.AdaptiveMaxPool1d(output_size=10)
features = rm.randn(4, 16, 50)
output = adaptive_pool(features)
print(output.shape)  # [4, 16, 10]

AdaptiveMaxPool2d Layer

Purpose:

Apply 2D adaptive max pooling to image data
Preserve the most salient features in local regions
Suitable for vision tasks requiring preservation of spatial maximum value information

Parameters:

output_size: Output size, can be an integer tuple (H, W) or a single integer
return_indices: Whether to return the indices of maximum values, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Adaptive max pooling
adaptive_pool = nn.AdaptiveMaxPool2d(output_size=(7, 7))
features = rm.randn(4, 64, 224, 224)
output = adaptive_pool(features)
print(output.shape)  # [4, 64, 7, 7]

AdaptiveMaxPool3d Layer

Purpose:

Apply 3D adaptive max pooling to 3D data
Preserve the most salient features in 3D space
Suitable for video analysis, medical images, and other 3D data processing

Parameters:

output_size: Output size, can be an integer tuple (D, H, W) or a single integer
return_indices: Whether to return the indices of maximum values, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# 3D adaptive max pooling
adaptive_pool = nn.AdaptiveMaxPool3d(output_size=(4, 7, 7))
features = rm.randn(4, 32, 16, 64, 64)
output = adaptive_pool(features)
print(output.shape)  # [4, 32, 4, 7, 7]

Average Pooling Layers

Average pooling layers compute the average value within the pooling window, providing smooth downsampling and preserving overall statistical information. Riemann provides both standard average pooling and adaptive average pooling.

Standard Average Pooling

AvgPool1d Layer

Purpose:

Apply 1D average pooling to sequence data, computing the average within the sliding window
Provide smooth downsampling for sequential data
Preserve overall statistical information

Parameters:

kernel_size: Pooling window size
stride: Pooling stride, defaults to kernel_size
padding: Padding size, default 0
ceil_mode: Whether to use ceiling, default False
count_include_pad: Whether to include padding values in average calculation, default True
divisor_override: Custom divisor for average computation, default None

Usage Example:

import riemann as rm
import riemann.nn as nn

# Smooth sequence downsampling
avgpool = nn.AvgPool1d(kernel_size=3, stride=2, padding=1)
features = rm.randn(4, 16, 100)  # batch=4, channels=16, length=100
output = avgpool(features)
print(output.shape)  # [4, 16, 50]

AvgPool2d Layer

Purpose:

Provide smooth feature representation by computing the average of local regions
More robust to noise compared to max pooling
Preserve overall statistical information

Parameters:

kernel_size: Pooling window size
stride: Pooling stride, defaults to kernel_size
padding: Padding size, default 0
ceil_mode: Whether to use ceiling, default False
count_include_pad: Whether to include padding values in average calculation, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Smooth downsampling
avgpool = nn.AvgPool2d(kernel_size=2, stride=2)
features = rm.randn(4, 64, 224, 224)
output = avgpool(features)
print(output.shape)  # [4, 64, 112, 112]

AvgPool3d Layer

Purpose:

Apply 3D average pooling for volumetric data such as video and medical images
Provide smooth 3D downsampling while preserving overall spatiotemporal information
More robust to noise compared to 3D max pooling

Parameters:

kernel_size: Pooling window size (can be int or tuple of depth, height, width)
stride: Pooling stride, defaults to kernel_size
padding: Padding size, default 0
ceil_mode: Whether to use ceiling, default False
count_include_pad: Whether to include padding values in average calculation, default True
divisor_override: Custom divisor for average computation, default None

Usage Example:

import riemann as rm
import riemann.nn as nn

# 3D data smooth downsampling
avgpool = nn.AvgPool3d(kernel_size=2, stride=2)
features = rm.randn(4, 32, 16, 64, 64)  # batch=4, channels=32, depth=16, height=64, width=64
output = avgpool(features)
print(output.shape)  # [4, 32, 8, 32, 32]

Adaptive Average Pooling

Adaptive pooling layers automatically compute the pooling kernel size and stride based on the specified output size, ensuring the output dimensions are always fixed without manual calculation of pooling parameters.

AdaptiveAvgPool1d Layer

Purpose:

Apply 1D adaptive average pooling to sequence data
Map sequences of arbitrary length to a specified fixed length
Commonly used in the output layer of sequence models to unify dimensions of different length sequences

Parameters:

output_size: Output sequence length, can be an integer or None (indicating maintaining original size)

Usage Example:

import riemann as rm
import riemann.nn as nn

# Map sequences of different lengths to fixed length 10
adaptive_pool = nn.AdaptiveAvgPool1d(output_size=10)

# Input sequence length 50
features = rm.randn(4, 16, 50)  # batch=4, channels=16, length=50
output = adaptive_pool(features)
print(output.shape)  # [4, 16, 10]

# Input sequence length 100, output still 10
features = rm.randn(4, 16, 100)
output = adaptive_pool(features)
print(output.shape)  # [4, 16, 10]

AdaptiveAvgPool2d Layer

Purpose:

Apply 2D adaptive average pooling to image data
Map feature maps of arbitrary sizes to a specified fixed size
Commonly used at the end of CNNs to convert image features of different sizes to fixed dimensions

Parameters:

output_size: Output size, can be an integer tuple (H, W) or a single integer (indicating square output)

Usage Example

MNIST Handwritten Digit Recognition Example

Below is a complete CNN model example for MNIST handwritten digit recognition, including full training and inference workflows:

import riemann as rm
import riemann.nn as nn
import riemann.optim as opt
from riemann.vision.datasets import MNIST
from riemann.vision import transforms
from riemann.utils.data import DataLoader
from riemann import cuda

class MNISTNet(nn.Module):
    """MNIST Handwritten Digit Recognition Network"""

    def __init__(self):
        super().__init__()
        # Feature extraction layers
        self.features = nn.Sequential(
            # First convolution: 1@28x28 -> 32@28x28
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 32@14x14

            # Second convolution: 32@14x14 -> 64@14x14
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 64@7x7
        )

        # Classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, 10)
        )

        # Loss function
        self.loss_func = nn.CrossEntropyLoss()

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

    def train_step(self, inputs, targets):
        """Single training step"""
        outputs = self.forward(inputs)
        loss = self.loss_func(outputs, targets)
        self.optimizer.zero_grad(True)
        loss.backward()
        self.optimizer.step()
        return loss

    def evaluate(self, dataloader, device):
        """Evaluate model performance"""
        total_loss = 0
        correct = 0
        total = 0

        for batch in dataloader:
            img_tensors, target_tensors = batch
            # Move data to device
            img_tensors = img_tensors.to(device)
            target_tensors = target_tensors.to(device)

            outputs = self.forward(img_tensors)

            # Compute loss
            loss = self.loss_func(outputs, target_tensors)
            total_loss += loss.item()

            # Compute accuracy
            predicted = outputs.argmax(dim=1)
            total += target_tensors.size(0)
            correct += (predicted == target_tensors).sum().item()

        accuracy = correct / total
        avg_loss = total_loss / len(dataloader)
        return accuracy, avg_loss


def main():
    """Main function: complete training and inference workflow"""
    print("MNIST Handwritten Digit Recognition CNN Example")

    # Check CUDA availability
    CUDA_AVAILABLE = cuda.CUPY_AVAILABLE
    device = 'cuda' if CUDA_AVAILABLE else 'cpu'
    print(f"Using device: {device}")

    # 1. Data preparation
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))  # Mean and std for MNIST
    ])

    # Load datasets
    train_dataset = MNIST(root='./data', train=True, transform=transform)
    test_dataset = MNIST(root='./data', train=False, transform=transform)

    # Create data loaders (batch size 256 for better efficiency)
    train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=256, shuffle=False)

    # 2. Create model and move to device
    model = MNISTNet()
    model.to(device)
    print(f"Model structure:\n{model}")

    # Initialize optimizer (after moving model to device)
    model.optimizer = opt.Adam(model.parameters(), lr=0.001)

    # 3. Train model
    num_epochs = 5
    print(f"\nStarting training for {num_epochs} epochs...")

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0
        for batch_idx, (images, labels) in enumerate(train_loader):
            # Move data to device
            images = images.to(device)
            labels = labels.to(device)

            loss = model.train_step(images, labels)
            train_loss += loss.item()

            if batch_idx % 50 == 0:
                print(f"Epoch [{epoch+1}/{num_epochs}], "
                      f"Batch [{batch_idx}/{len(train_loader)}], "
                      f"Loss: {loss.item():.4f}")

        # Evaluation phase
        model.eval()
        test_accuracy, test_loss = model.evaluate(test_loader, device)
        avg_train_loss = train_loss / len(train_loader)

        print(f"Epoch [{epoch+1}/{num_epochs}] completed: "
              f"Train Loss: {avg_train_loss:.4f}, "
              f"Test Loss: {test_loss:.4f}, "
              f"Test Accuracy: {test_accuracy*100:.2f}%")

    # 4. Inference demonstration
    print("\nInference demonstration:")
    model.eval()

    # Get a batch of test data
    test_images, test_labels = next(iter(test_loader))
    # Move data to device
    test_images = test_images.to(device)
    test_labels = test_labels.to(device)

    # Forward propagation
    with rm.no_grad():
        outputs = model(test_images[:5])
        predictions = outputs.argmax(dim=1)

    print(f"Predictions: {predictions.tolist()}")
    print(f"True labels: {test_labels[:5].tolist()}")
    print(f"Prediction accuracy: {(predictions == test_labels[:5]).sum().item() / 5 * 100:.2f}%")

if __name__ == "__main__":
    main()

CIFAR-10 Image Classification Example

Below is a complete CNN model example for CIFAR-10 image classification, including full training and inference workflows:

import riemann as rm
import riemann.nn as nn
import riemann.optim as opt
from riemann.vision.datasets import CIFAR10
from riemann.vision import transforms
from riemann.utils.data import DataLoader
from riemann import cuda

class CIFAR10Net(nn.Module):
    """CIFAR-10 Image Classification Network (Simplified)"""

    def __init__(self):
        super().__init__()
        # Feature extraction layers (simplified, fewer conv layers)
        self.features = nn.Sequential(
            # First layer: 3@32x32 -> 32@16x16
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout(0.25),

            # Second layer: 32@16x16 -> 64@8x8
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout(0.25),

            # Third layer: 64@8x8 -> 128@4x4
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout(0.25),
        )

        # Classifier
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, 10)
        )

        # Loss function
        self.loss_func = nn.CrossEntropyLoss()

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

    def train_step(self, inputs, targets):
        """Single training step"""
        outputs = self.forward(inputs)
        loss = self.loss_func(outputs, targets)
        self.optimizer.zero_grad(True)
        loss.backward()
        self.optimizer.step()
        return loss

    def evaluate(self, dataloader, device):
        """Evaluate model performance"""
        total_loss = 0
        correct = 0
        total = 0

        for batch in dataloader:
            img_tensors, target_tensors = batch
            # Move data to device
            img_tensors = img_tensors.to(device)
            target_tensors = target_tensors.to(device)

            outputs = self.forward(img_tensors)

            # Compute loss
            loss = self.loss_func(outputs, target_tensors)
            total_loss += loss.item()

            # Compute accuracy
            predicted = outputs.argmax(dim=1)
            total += target_tensors.size(0)
            correct += (predicted == target_tensors).sum().item()

        accuracy = correct / total
        avg_loss = total_loss / len(dataloader)
        return accuracy, avg_loss


def main():
    """Main function: complete training and inference workflow"""
    print("CIFAR-10 Image Classification CNN Example")

    # Check CUDA availability
    CUDA_AVAILABLE = cuda.CUPY_AVAILABLE
    device = 'cuda' if CUDA_AVAILABLE else 'cpu'
    print(f"Using device: {device}")

    # 1. Data preparation
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize RGB channels
    ])

    # Load datasets
    train_dataset = CIFAR10(root='./data', train=True, transform=transform)
    test_dataset = CIFAR10(root='./data', train=False, transform=transform)

    # Create data loaders (batch size 512 for better efficiency)
    train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=512, shuffle=False)

    # 2. Create model and move to device
    model = CIFAR10Net()
    model.to(device)
    print(f"Model structure:\n{model}")

    # Initialize optimizer (after moving model to device)
    model.optimizer = opt.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

    # 3. Train model
    num_epochs = 5
    print(f"\nStarting training for {num_epochs} epochs...")

    best_accuracy = 0.0
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0
        for batch_idx, (images, labels) in enumerate(train_loader):
            # Move data to device
            images = images.to(device)
            labels = labels.to(device)

            loss = model.train_step(images, labels)
            train_loss += loss.item()

            if batch_idx % 50 == 0:
                print(f"Epoch [{epoch+1}/{num_epochs}], "
                      f"Batch [{batch_idx}/{len(train_loader)}], "
                      f"Loss: {loss.item():.4f}")

        # Evaluation phase
        model.eval()
        test_accuracy, test_loss = model.evaluate(test_loader, device)
        avg_train_loss = train_loss / len(train_loader)

        print(f"Epoch [{epoch+1}/{num_epochs}] completed: "
              f"Train Loss: {avg_train_loss:.4f}, "
              f"Test Loss: {test_loss:.4f}, "
              f"Test Accuracy: {test_accuracy*100:.2f}%")

        # Save best model
        if test_accuracy > best_accuracy:
            best_accuracy = test_accuracy
            print(f"  -> Best model updated! Accuracy: {best_accuracy*100:.2f}%")

    print(f"\nTraining completed! Best test accuracy: {best_accuracy*100:.2f}%")

    # 4. Inference demonstration
    print("\nInference demonstration:")
    model.eval()

    # Get a batch of test data
    test_images, test_labels = next(iter(test_loader))
    # Move data to device
    test_images = test_images.to(device)
    test_labels = test_labels.to(device)

    # Class names
    classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

    # Forward propagation
    with rm.no_grad():
        outputs = model(test_images[:5])
        predictions = outputs.argmax(dim=1)

    print(f"Predicted classes: {[classes[p] for p in predictions.tolist()]}")
    print(f"True classes: {[classes[t] for t in test_labels[:5].tolist()]}")
    print(f"Prediction accuracy: {(predictions == test_labels[:5]).sum().item() / 5 * 100:.2f}%")

if __name__ == "__main__":
    main()

CNN Design Guidelines

Receptive Field Design:
- Stack multiple small convolutional kernels (e.g., 3x3) instead of large kernels to reduce parameters while maintaining the same receptive field
- Use dilated convolution to increase receptive field without adding parameters
Downsampling Strategy:
- Use pooling layers (MaxPool/AvgPool) or convolutions with stride > 1 for downsampling
- Downsampling gradually reduces feature map size and increases feature channels to extract higher-level features
Normalization and Regularization:
- Using BatchNorm after convolutional layers can accelerate training and improve model stability
- Use Dropout to prevent overfitting
Activation Function Selection:
- ReLU is the most commonly used activation function, computationally simple and effective for mitigating vanishing gradients
- LeakyReLU or GELU may perform better in deep networks

Transformer

Transformer is a deep learning architecture based on attention mechanisms, originally designed for natural language processing tasks but now widely applied in computer vision, speech recognition, and other fields. Riemann provides complete Transformer components compatible with PyTorch interfaces.

Transformer Architecture Overview

Transformer consists of two parts: Encoder and Decoder:

Encoder: Encodes the input sequence into a continuous representation (memory)
Decoder: Generates output sequences autoregressively based on the encoder’s output and previously generated target sequences

Input Sequence → [Encoder] → Memory → [Decoder] → Output Sequence
                      ↑_____________↓
                       Cross Attention

MultiheadAttention Mechanism

Multi-head attention is the core component of Transformer, allowing the model to simultaneously attend to information from different representation subspaces.

Principle:

Multi-head attention projects Query, Key, and Value inputs into multiple subspaces (heads), computes attention independently in each subspace, then concatenates the results and projects again:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) @ W^O
where head_i = Attention(Q @ W_i^Q, K @ W_i^K, V @ W_i^V)

Attention(Q, K, V) = softmax(Q @ K^T / √d_k) @ V

Purpose:

Capture dependencies between different positions in sequences
Self-attention mechanism allows each position to attend to all positions in the sequence
Multi-head design enables the model to attend to different types of information

Parameters:

embed_dim: Input and output dimension
num_heads: Number of attention heads, must divide embed_dim evenly
dropout: Dropout probability for attention weights, default 0.0
bias: Whether to use bias, default True
add_bias_kv: Whether to add learnable bias to key and value, default False
add_zero_attn: Whether to add zero vectors at the end of key and value sequences, default False
kdim: Key dimension, default None (uses embed_dim)
vdim: Value dimension, default None (uses embed_dim)
batch_first: Whether input format is (batch, seq, feature), default False

Forward Parameters:

query, key, value: Input tensors, shape depends on batch_first
attn_mask: Attention mask, supports 2D (tgt_len, src_len) or 3D (batch*num_heads, tgt_len, src_len)
key_padding_mask: Key padding mask, supports bool or float type, shape (batch, src_len)
is_causal: Whether to use causal mask (prevent attending to future positions), default False
need_weights: Whether to return attention weights, default True
average_attn_weights: Whether to average attention weights across heads, default True

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create multi-head attention layer
mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)

# Input tensors
batch_size, seq_len, embed_dim = 2, 10, 512
query = rm.randn(batch_size, seq_len, embed_dim)
key = rm.randn(batch_size, seq_len, embed_dim)
value = rm.randn(batch_size, seq_len, embed_dim)

# Forward propagation
output, attn_weights = mha(query, key, value)
print(f"Output shape: {output.shape}")  # [2, 10, 512]
print(f"Attention weights shape: {attn_weights.shape}")  # [2, 8, 10, 10] (batch, num_heads, tgt_len, src_len)

Using Masks Example:

import riemann as rm
import riemann.nn as nn

mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)

batch_size, seq_len, embed_dim = 2, 10, 512
query = rm.randn(batch_size, seq_len, embed_dim)
key = rm.randn(batch_size, seq_len, embed_dim)
value = rm.randn(batch_size, seq_len, embed_dim)

# Causal mask (for autoregressive models)
output, attn_weights = mha(query, key, value, is_causal=True)

# Custom attention mask (2D)
attn_mask = rm.zeros(seq_len, seq_len)
attn_mask[0, 5:] = float('-inf')  # Position 0 cannot attend to positions 5 and beyond
output, _ = mha(query, key, value, attn_mask=attn_mask)

# Key padding mask (ignore padding positions)
key_padding_mask = rm.zeros(batch_size, seq_len)
key_padding_mask[0, 8:] = float('-inf')  # Positions 8+ in sample 0 are padding
output, _ = mha(query, key, value, key_padding_mask=key_padding_mask)

Transformer Encoder

The encoder consists of multiple identical encoder layers stacked together. Each encoder layer contains:

Multi-head self-attention: Processes relationships within the input sequence
Feed-forward network: Applies non-linear transformations independently to each position
Residual connections and layer normalization: Stabilizes training

Two Normalization Modes:

Post-LN (default): Execute sublayer first, then normalize (original Transformer paper)
Pre-LN: Normalize first, then execute sublayer (more stable training)

Components:

TransformerEncoderLayer: Single encoder layer
TransformerEncoder: Complete encoder composed of N encoder layers

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create encoder layer
encoder_layer = nn.TransformerEncoderLayer(
    d_model=512, nhead=8, dim_feedforward=2048,
    dropout=0.1, batch_first=True
)

# Create encoder (6 layers)
encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

# Input sequence (batch=2, seq_len=10, d_model=512)
src = rm.randn(2, 10, 512)

# Forward propagation
output = encoder(src)
print(f"Encoder output shape: {output.shape}")  # [2, 10, 512]

Transformer Decoder

The decoder consists of multiple identical decoder layers stacked together. Each decoder layer contains:

Masked multi-head self-attention: Prevents attending to future positions (autoregressive)
Cross-attention: Attends to encoder output (memory)
Feed-forward network: Non-linear transformation
Residual connections and layer normalization

Components:

TransformerDecoderLayer: Single decoder layer
TransformerDecoder: Complete decoder composed of N decoder layers

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create decoder layer
decoder_layer = nn.TransformerDecoderLayer(
    d_model=512, nhead=8, dim_feedforward=2048,
    dropout=0.1, batch_first=True
)

# Create decoder (6 layers)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)

# Target sequence (batch=2, tgt_len=20, d_model=512)
tgt = rm.randn(2, 20, 512)

# Encoder output (batch=2, src_len=10, d_model=512)
memory = rm.randn(2, 10, 512)

# Forward propagation
output = decoder(tgt, memory)
print(f"Decoder output shape: {output.shape}")  # [2, 20, 512]

Complete Transformer Model

Riemann provides a complete Transformer model containing both encoder and decoder.

Parameters:

d_model: Model dimension, default 512
nhead: Number of attention heads, default 8
num_encoder_layers: Number of encoder layers, default 6
num_decoder_layers: Number of decoder layers, default 6
dim_feedforward: Feed-forward network dimension, default 2048
dropout: Dropout probability, default 0.1
activation: Activation function, ‘relu’ or ‘gelu’, default ‘relu’
batch_first: Input format, default False

Usage Example:

import riemann as rm
import riemann.nn as nn

# Create Transformer model
transformer = nn.Transformer(
    d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6,
    dim_feedforward=2048, dropout=0.1, batch_first=True
)

# Source sequence (batch=2, src_len=10, d_model=512)
src = rm.randn(2, 10, 512)

# Target sequence (batch=2, tgt_len=20, d_model=512)
tgt = rm.randn(2, 20, 512)

# Forward propagation
output = transformer(src, tgt)
print(f"Transformer output shape: {output.shape}")  # [2, 20, 512]

Machine Translation Example

Below is a complete machine translation model example demonstrating the use of Transformer in training and inference:

import riemann as rm
import riemann.nn as nn

class TransformerTranslationModel(nn.Module):
    """Transformer Machine Translation Model"""

    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8,
                 num_encoder_layers=6, num_decoder_layers=6, max_seq_len=100):
        super().__init__()
        self.d_model = d_model

        # Word embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)

        # Positional encoding (simplified as learnable parameters)
        self.pos_encoding = nn.Embedding(max_seq_len, d_model)

        # Transformer
        self.transformer = nn.Transformer(
            d_model=d_model, nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=2048, dropout=0.1,
            batch_first=True
        )

        # Output projection
        self.output_proj = nn.Linear(d_model, tgt_vocab_size)

    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        """Training forward propagation"""
        # Add positional encoding
        src_pos = rm.arange(src.shape[1]).expand(src.shape[0], -1)
        tgt_pos = rm.arange(tgt.shape[1]).expand(tgt.shape[0], -1)

        src_emb = self.src_embedding(src) + self.pos_encoding(src_pos)
        tgt_emb = self.tgt_embedding(tgt) + self.pos_encoding(tgt_pos)

        # Transformer forward propagation
        output = self.transformer(src_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask)

        # Project to vocabulary dimension
        logits = self.output_proj(output)
        return logits

    def generate(self, src, max_len=50, start_token=1, end_token=2):
        """Inference: Autoregressive generation of translation results"""
        self.eval()

        # Encode source sequence
        src_pos = rm.arange(src.shape[1]).expand(src.shape[0], -1)
        src_emb = self.src_embedding(src) + self.pos_encoding(src_pos)
        memory = self.transformer.encoder(src_emb)

        # Autoregressive generation
        tgt = rm.full((src.shape[0], 1), start_token, dtype=rm.int64)

        for _ in range(max_len):
            # Generate causal mask (upper triangle is True, prevent attending to future positions)
            tgt_mask = rm.full((tgt.shape[1], tgt.shape[1]), float('-inf'))
            tgt_mask = tgt_mask.triu(diagonal=1)  # Upper triangle (excluding diagonal) set to -inf

            # Decode
            tgt_pos = rm.arange(tgt.shape[1]).expand(tgt.shape[0], -1)
            tgt_emb = self.tgt_embedding(tgt) + self.pos_encoding(tgt_pos)
            output = self.transformer.decoder(tgt_emb, memory, tgt_mask=tgt_mask)

            # Predict next token
            logits = self.output_proj(output[:, -1, :])
            next_token = logits.argmax(dim=-1, keepdim=True)

            # Add to sequence
            tgt = rm.concatenate([tgt, next_token], dim=1)

            # Check if end token is generated
            if (next_token == end_token).all():
                break

        return tgt

# Create model
model = TransformerTranslationModel(
    src_vocab_size=10000, tgt_vocab_size=10000,
    d_model=512, nhead=8, num_encoder_layers=6
)

# Simulate training data
src = rm.randint(0, 10000, (2, 20))  # Source sequence
tgt = rm.randint(0, 10000, (2, 25))  # Target sequence

# Training forward propagation
logits = model(src, tgt)
print(f"Training output shape: {logits.shape}")  # [2, 25, 10000]

# Inference generation
generated = model.generate(src, max_len=30)
print(f"Generated sequence shape: {generated.shape}")

Differences Between Encoder and Decoder

Encoder vs Decoder
Characteristic	Encoder	Decoder
Attention Type	Self-attention only	Self-attention + Cross-attention
Masking	No mask (can see all input)	Causal mask (cannot see future positions)
Input	Source sequence	Target sequence + Encoder output
Application Scenarios	Text classification, sentiment analysis, feature extraction	Machine translation, text generation, summarization

Training and Inference Workflow

Training Phase:

Encoder processes the entire source sequence to generate memory
Decoder receives the target sequence (teacher forcing) and encoder memory
Use causal mask to prevent decoder from attending to future positions
Compute loss and backpropagate

Inference Phase:

Encoder processes the source sequence to generate memory
Decoder generates autoregressively:
- Start with start token
- Generate next token based on previously generated tokens and encoder memory
- Repeat until end token or maximum length

Training:
Source: [I, love, you] ──→ Encoder ──→ Memory
Target: [我, 爱, 你] ───→ Decoder ──→ Output
                               ↑
                               └── Memory (from Encoder)

Inference:
Source: [I, love, you] ──→ Encoder ──→ Memory
                                          ↓
Generated: [我] ───────→ Decoder ──→ [爱]
   ↑                                      ↓
   └──────────────────────────────── [你]

Transformer Design Guidelines

Model Dimension Selection:
- d_model is typically 512 or 768, balancing model capacity and computational cost
- num_heads should divide d_model evenly (e.g., 512/8=64)
Layer Depth:
- Standard configuration is 6 encoder layers + 6 decoder layers
- Encoder-only models (e.g., BERT) can use 12-24 layers
- Decoder-only models (e.g., GPT) can use 12-96 layers
Positional Encoding:
- Essential for Transformer as it has no inherent sequential information
- Can use learnable positional embeddings or sinusoidal encoding
- Some modern variants use Rotary Position Embedding (RoPE)
Attention Mask Usage:
- src_mask: Used when source sequence contains padding
- tgt_mask: Causal mask to prevent attending to future positions
- memory_mask: Controls which encoder positions decoder can attend to
Optimization Tips:
- Use learning rate warmup to stabilize early training
- Label smoothing can improve generalization

KAN Networks

Kolmogorov-Arnold Networks (KAN) are a novel neural network architecture that uses learnable B-spline activation functions instead of traditional fixed activation functions. KAN is based on the Kolmogorov-Arnold representation theorem, which proves that any multivariate continuous function can be represented as a composition of univariate continuous functions.

KAN Network Principles

Core Idea

Traditional MLPs use fixed nonlinear activation functions (e.g., ReLU, Sigmoid):

\[\text{MLP: } x \mapsto \sum_i w_i \cdot \sigma(\text{activation}(x))\]

KAN uses learnable univariate functions to replace fixed activations:

\[\text{KAN: } x \mapsto \sum_i \phi_i(x_i) \cdot w_i\]

where \(\phi_i\) are learnable B-spline functions.

Dual-Path Computation

KANLinear layer contains two computation paths:

Base Function Path: Uses fixed activation functions (e.g., SiLU) to provide basic nonlinearity
Spline Path: Uses learnable B-spline functions to provide flexible nonlinear transformations

Input x
    │
    ├──→ [Base Path] ──→ SiLU(x) @ base_weight ────────┐
    │                                                  ├──→ Add ──→ Output
    └──→ [Spline Path] ─→ B-splines(x) @ spline_weight ┘

B-Spline Basis Functions

B-splines are piecewise polynomial functions with local support and smoothness. KAN uses the de Boor recursive formula to compute B-spline basis functions.

Order vs. Degree

B-spline “order” (k) is different from polynomial “degree”:

Order (k): The recursive order of B-spline, determines complexity
Degree: The highest power of the actual polynomial, equals order - 1

For example, 3rd-order B-spline corresponds to 2nd-degree (quadratic) polynomial with continuous first derivative.

de Boor Recursive Algorithm

de Boor algorithm is the standard method for computing B-spline basis functions:

0-order basis function (indicator function):

\[\begin{split}B_{i,0}(x) = \begin{cases} 1 & \text{if } t_i \leq x < t_{i+1} \\ 0 & \text{otherwise} \end{cases}\end{split}\]

k-order basis function (recursive definition):

\[B_{i,k}(x) = \frac{x - t_i}{t_{i+k} - t_i} B_{i,k-1}(x) + \frac{t_{i+k+1} - x}{t_{i+k+1} - t_{i+1}} B_{i+1,k-1}(x)\]

where \(t_i\) are the knot points.

Algorithm Properties:

Local Support: Each basis function is non-zero only in a limited interval
Partition of Unity: Sum of all basis functions equals 1 at any point
Continuity: k-order B-spline has \(C^{k-2}\) continuity

Interpretability of B-Spline Grid

B-spline grids have natural interpretability advantages:

Visual Understanding: Each basis function’s shape is visible and can be plotted
Local Control: Each grid interval corresponds to a local basis function
Smoothness Guarantee: Learned functions are naturally smooth
Symbolic Expression: B-splines can be converted to piecewise polynomial expressions

Adaptive Grid

KAN supports adaptive grid updates, dynamically adjusting grid point positions based on input data distribution to better fit the data.

Why Adaptive Grid is Needed

Fixed grids have problems with non-uniform data distributions:

Sparse Data Regions: Grid points too dense, wasting computation
Dense Data Regions: Grid points too sparse, insufficient fitting accuracy
Boundary Effects: Fixed grids may not cover actual data range

Adaptive Grid Algorithm

Riemann’s KAN implementation uses the following adaptive strategy:

Step 1: Compute Current Spline Output

splines = self.b_splines(x)
unreduced_spline_output = splines @ orig_coeff

Step 2: Build Adaptive Grid

Based on actual data distribution:

\[\text{grid}_{\text{adaptive}} = \text{sorted_data}\left[\text{linspace}(0, N-1, \text{grid_size}+1)\right]\]

Step 3: Build Uniform Grid

Covering data range with uniform spacing:

\[\text{grid}_{\text{uniform}} = \text{linspace}(\min(x)-\epsilon, \max(x)+\epsilon, \text{grid_size}+1)\]

Step 4: Mix Grids

Combining adaptive and uniform grids using grid_eps:

\[\text{grid} = \text{grid_eps} \cdot \text{grid}_{\text{uniform}} + (1 - \text{grid_eps}) \cdot \text{grid}_{\text{adaptive}}\]

Step 5: Extend Boundaries

Adding extra nodes at both ends:

grid = concatenate([
    grid[:1] - step * arange(spline_order, 0, -1),
    grid,
    grid[-1:] + step * arange(1, spline_order + 1)
])

Step 6: Update Spline Coefficients

Using least squares to map old grid to new grid:

self.spline_weight.data = self.curve2coeff(x, unreduced_spline_output).data

Algorithm Advantages:

Data-Driven: Grid automatically adapts to data distribution
Smooth Transition: Mixing strategy avoids training instability
Computationally Efficient: Only updates when necessary (e.g., every 20 epochs)

KAN Network Application Scenarios

KAN networks are particularly suitable for the following scenarios:

1. Tasks Requiring High Interpretability

Scientific computing and physical modeling
Tasks requiring understanding of feature importance
Explainable AI (XAI) applications

2. Function Fitting and Symbolic Regression

Discovering mathematical formulas behind data
Physical law discovery
Equation fitting

3. Few-Shot Learning

Higher parameter efficiency
Better performance with limited data
Avoids overfitting

4. Tasks Requiring Smooth Outputs

B-splines provide smooth function approximation
Suitable for applications requiring continuous derivatives
Physical simulation and control systems

5. Comparison with Traditional MLPs

Feature	MLP	KAN
Activation	Fixed	Learnable (B-spline)
Interpretability	Low	High
Parameter Efficiency	Average	High
Training Speed	Fast	Slower
Use Cases	General	Scientific, XAI

Riemann’s KAN Module

Riemann provides a complete KAN implementation in the riemann.nn.kan module:

Main Components:

KANLinear: KAN linear layer, the core building block
KAN: Multi-layer KAN network container

Features:

Efficient matrix multiplication implementation
Support for adaptive grid updates
L1 regularization and entropy regularization
Full compatibility with Riemann autograd

KANLinear Module

Structure Description

KANLinear is the basic building block of KAN networks, containing the following parameters:

Parameters:

in_features: Input feature dimension
out_features: Output feature dimension
grid_size: Grid size, controls the number of B-spline segments (default: 5)
spline_order: B-spline order, controls smoothness (default: 3)
scale_noise: Noise scaling coefficient for initialization (default: 0.1)
scale_base: Base function weight scaling coefficient (default: 1.0)
scale_spline: Spline weight scaling coefficient (default: 1.0)
enable_standalone_scale_spline: Whether to enable independent spline scaling (default: True)
base_activation: Base function activation function (default: SiLU)
grid_eps: Interpolation coefficient for grid updates (default: 0.02)
grid_range: Grid value range (default: [-1, 1])

Internal Parameters:

base_weight: Base function path weights, shape (out_features, in_features)
spline_weight: Spline path weights, shape (out_features, in_features, grid_size + spline_order)
spline_scaler: Spline scaling factor, shape (out_features, in_features) (optional)
grid: B-spline grid points, shape (in_features, grid_size + 2*spline_order + 1)

Usage Examples

Basic usage example:

import riemann as rm
from riemann.nn import KANLinear

# Create KAN linear layer
layer = KANLinear(
    in_features=10,
    out_features=5,
    grid_size=5,
    spline_order=3
)

# Input data
x = rm.randn(4, 10)  # (batch_size, in_features)

# Forward pass
output = layer(x)
print(f"Output shape: {output.shape}")  # (4, 5)

Training with adaptive grid updates:

# Update grid during training
for epoch in range(num_epochs):
    for batch in dataloader:
        x, y = batch

        # Update grid every few epochs
        if epoch % 20 == 0:
            layer.update_grid(x)

        output = layer(x)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

Using regularization:

# Compute regularization loss
reg_loss = layer.regularization_loss(
    regularize_activation=1.0,
    regularize_entropy=1.0
)

# Total loss = task loss + regularization loss
total_loss = task_loss + 0.01 * reg_loss

KAN Container Module

Structure Description

KAN is a container for multi-layer KAN networks, automatically stacking multiple KANLinear layers.

Parameters:

layers_hidden: List of hidden layer dimensions, e.g., [28*28, 64, 10]
grid_size: Grid size (default: 5)
spline_order: B-spline order (default: 3)
scale_noise: Noise scaling coefficient (default: 0.1)
scale_base: Base function weight scaling coefficient (default: 1.0)
scale_spline: Spline weight scaling coefficient (default: 1.0)
base_activation: Base function activation function (default: SiLU)
grid_eps: Grid update interpolation coefficient (default: 0.02)
grid_range: Grid value range (default: [-1, 1])

Usage Examples

Create multi-layer KAN network:

from riemann.nn import KAN

# Create multi-layer KAN network
model = KAN(
    layers_hidden=[784, 64, 32, 10],
    grid_size=5,
    spline_order=3
)

# Input data
x = rm.randn(4, 784)

# Forward pass
output = model(x)
print(f"Output shape: {output.shape}")  # (4, 10)

Update grid during training:

# Update grid during forward pass
output = model(x, update_grid=True)

Complete training example:

import riemann as rm
from riemann.nn import KAN
from riemann.optim import Adam

# Create model
model = KAN([784, 64, 10], grid_size=5, spline_order=3)
optimizer = Adam(model.parameters(), lr=0.001)
criterion = rm.nn.CrossEntropyLoss()

# Training loop
for epoch in range(100):
    for batch_idx, (data, target) in enumerate(train_loader):
        # Flatten images
        data = data.view(data.size(0), -1)

        # Update grid every 20 epochs
        update_grid = (epoch % 20 == 0 and batch_idx == 0)

        # Forward pass
        output = model(data, update_grid=update_grid)

        # Compute loss
        loss = criterion(output, target)

        # Add regularization
        reg_loss = model.regularization_loss()
        total_loss = loss + 0.01 * reg_loss

        # Backward pass
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

KAN Design Guidelines

Grid Size Selection:
- Small grid (3-5): Suitable for simple functions, fewer parameters
- Large grid (10-20): Suitable for complex functions, but more parameters
- Recommend starting with 5 and adjusting based on task
Spline Order Selection:
- 1st order: Linear splines, discontinuous derivatives
- 3rd order: Cubic splines, recommended default
- 5th order: Higher smoothness, but more computation
Grid Update Strategy:
- Update frequently in early training (every 10-20 epochs)
- Reduce update frequency in later training
- Avoid updating every batch (computational overhead)
Regularization Usage:
- L1 regularization promotes sparsity
- Entropy regularization promotes selectivity
- Regularization coefficient recommended 0.001-0.01
- Gradient clipping prevents gradient explosion