Automatic Differentiation Basics

Riemann’s automatic differentiation engine automatically records tensor computations, building a computation graph, and efficiently computes derivatives through backpropagation. This is essential for training neural networks and other optimization tasks.

Core Concepts

  • Computation Graph: A directed graph automatically constructed by Riemann in the background that records the relationships between tensor operations. Each node represents a tensor, and edges represent operations.

  • Forward Pass: The process of executing operations starting from input tensors along the computation graph to obtain the final output.

  • Backward Propagation (Backprop): The process of propagating gradients backward along the computation graph starting from the output tensor to compute derivatives for each input tensor.

  • Gradient: The partial derivative of a scalar output tensor with respect to other tensors, representing the rate of change of the output relative to the input.

  • Leaf Node Tensor: A tensor created directly by the user (e.g., through rm.tensor()) with requires_grad=True. These are typically model parameters.

  • Intermediate Node Tensor: A tensor created as a result of operations on other tensors. By default, gradients for intermediate nodes are not retained.

Gradient Computation Methods

Riemann provides two methods for computing gradients:

  1. backward() method: Suitable for computing gradients of multiple tensors at once. After calling, gradients for all participating leaf node tensors are computed and stored in their respective grad attributes.

  2. grad() function: Suitable for computing gradients of specific tensors. Allows precise control over which tensors’ gradients to compute, returning a tuple of gradients without modifying the tensors’ grad attributes.

Gradient Tracking Switch

By default, tensors don’t track their gradients. To enable gradient tracking, set requires_grad=True when creating a tensor:

import riemann as rm

# Tensor without gradient tracking
x = rm.tensor([1., 2., 3.])
print(x.requires_grad)  # False

# Tensor with gradient tracking
x = rm.tensor([1., 2., 3.], requires_grad=True)
print(x.requires_grad)  # True

You can also enable or disable gradient tracking on existing tensors:

x = rm.tensor([1., 2., 3.])
print(x.requires_grad)  # False

# Enable gradient tracking
x.requires_grad_(True)
print(x.requires_grad)  # True

Computing Gradients

Riemann provides two methods for computing gradients: the backward() method and the grad() function.

Using the backward() Method

The backward() method is suitable for computing gradients of multiple tensors at once. After calling, gradients are automatically stored in the grad attributes of participating leaf node tensors.

Function Signature:

tensor_object.backward(gradient=None, retain_graph=False, create_graph=False)

Parameters:

  • gradient (optional): When the output tensor is not a scalar, a gradient tensor with the same shape as the output is required. For scalar outputs, this parameter can be omitted, defaulting to None (equivalent to passing scalar 1).

  • retain_graph (optional): Whether to retain the computation graph. Defaults to False, meaning the graph is released after backpropagation. Set to True if you need to call backward() multiple times.

  • create_graph (optional): Whether to record the computation graph of gradients for subsequent computation of higher-order derivatives, defaults to False.

Use Cases:

  • Training neural networks, computing gradients for all trainable parameters at once

  • When multiple backward passes are needed (e.g., gradient accumulation)

  • Computing higher-order derivatives

Important Notes:

  • Only leaf node tensors with requires_grad=True will have their gradients computed

  • Intermediate node tensors do not retain gradients by default; call retain_grad() if you need gradients for intermediate nodes

  • Gradients accumulate in the grad attribute; manually zero gradients before multiple backward() calls

Example 1: Gradient Computation for Scalar Output

import riemann as rm

# Create tensors with gradient tracking (leaf nodes)
x = rm.tensor(2.0, requires_grad=True)
y = rm.tensor(3.0, requires_grad=True)

# Define computation (intermediate node)
z = x * y + x ** 2.

# Compute gradients
z.backward()

# Access gradients
print(x.grad)  # dz/dx = y + 2*x = 3 + 4 = 7
print(y.grad)  # dz/dy = x = 2

Example 2: Gradient Computation for Non-Scalar Output

import riemann as rm

# Create tensors with gradient tracking
x = rm.tensor([1., 2., 3.], requires_grad=True)

# Define computation that produces a non-scalar output
y = x * 2.

# Compute gradients with respect to a vector, gradient argument required
gradient = rm.tensor([1., 1., 1.])  # Vector for Jacobian-vector product
y.backward(gradient)

# Access gradients
print(x.grad)  # [2., 2., 2.]

Example 3: Retaining Gradients for Intermediate Nodes

import riemann as rm

x = rm.tensor(2.0, requires_grad=True)
y = x * 3  # Intermediate node
z = y ** 2  # Output

# Retain gradients for intermediate node y
y.retain_grad()

z.backward()

print(x.grad)  # dz/dx = 36
print(y.grad)  # dz/dy = 12 (because retain_grad() was called)

Using the grad() Function

The grad() function is suitable for computing gradients of specific tensors, allowing precise control over which tensors’ gradients to compute.

Function Signature:

riemann.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=False, create_graph=False, allow_unused=False)

Parameters:

  • outputs: Output tensor(s) (scalar or tensor), the starting point for gradient computation

  • inputs: Input tensor or tuple of tensors, specifying which tensors to compute gradients for

  • grad_outputs (optional): Gradient tensor required when outputs is not a scalar

  • retain_graph (optional): Whether to retain the computation graph, defaults to False

  • create_graph (optional): Whether to record the computation graph of gradients for subsequent computation of higher-order derivatives, defaults to False

  • allow_unused (optional): Whether to allow some input tensors to be unused, defaults to False

Use Cases:

  • When you only need gradients for specific tensors, not all leaf nodes

  • When you don’t want to modify the grad attributes of tensors

  • When you need more flexible control over the gradient computation process

Important Notes:

  • Gradients are returned as a tuple, in the same order as the inputs parameter

  • Only tensors specified in inputs will have their gradients computed

  • Does not modify the grad attributes of input tensors

  • Intermediate nodes, even with retain_grad() called, will not automatically have gradients computed in grad(); they must be explicitly specified

Example 1: Computing Gradients for Specific Tensors

import riemann as rm

x = rm.tensor(2.0, requires_grad=True)
y = rm.tensor(3.0, requires_grad=True)
z = rm.tensor(4.0, requires_grad=True)

# Define computation
w = x * y + z

# Only compute gradients for x and y, not z
grads = rm.autograd.grad(w, (x, y))

print(grads)  # (tensor(3.), tensor(2.))
print(x.grad)  # None (grad() does not modify grad attributes)

Example 2: Gradient Computation for Non-Scalar Output

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)
y = x * 2

# For non-scalar outputs, grad_outputs must be provided
grad_outputs = rm.tensor([1., 1., 1.])
grads = rm.autograd.grad(y, x, grad_outputs=grad_outputs)

print(grads)  # (tensor([2., 2., 2.]),)

Gradient Accumulation

Gradients are accumulated by default. This means that if you call backward() multiple times, the gradients will add up:

import riemann as rm

# Create tensor with gradient tracking
x = rm.tensor(1.0, requires_grad=True)

# First computation
y = x * 2.
y.backward()
print(x.grad)  # 2

# Second computation
y = x * 3.
y.backward()
print(x.grad)  # 2 + 3 = 5 (gradients accumulate)

# Clear gradients
if x.grad is not None:
    x.grad.zero_()
print(x.grad)  # 0

Gradient Computation Context Control

Riemann provides a flexible gradient computation context control mechanism through functions and context managers, allowing convenient enabling or disabling of gradient tracking. This is useful in model inference (where gradients should be disabled to save memory) and training (where gradients are needed) scenarios.

is_grad_enabled() Function

The is_grad_enabled() function checks whether gradient computation is currently enabled.

import riemann as rm

# Check current gradient status
print(rm.is_grad_enabled())  # True (enabled by default)

with rm.no_grad():
    print(rm.is_grad_enabled())  # False

no_grad() Context Manager/Decorator

no_grad() temporarily disables gradient computation. In this context, all computations will not track gradients, which is suitable for inference phases and can significantly reduce memory usage and accelerate computation.

As a context manager:

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)

with rm.no_grad():
    y = x * 2.
    print(y.requires_grad)  # False

As a function decorator:

import riemann as rm

@rm.no_grad
def inference(model, x):
    # Computations within the function will not track gradients
    return model(x)

enable_grad() Context Manager/Decorator

enable_grad() temporarily enables gradient computation. Can be used to temporarily enable gradients within a no_grad context.

As a context manager:

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)

with rm.no_grad():
    # Gradients are disabled here
    print(rm.is_grad_enabled())  # False

    with rm.enable_grad():
        # Gradients are temporarily enabled here
        y = x * 2.
        print(y.requires_grad)  # True

    # Back to disabled state
    print(rm.is_grad_enabled())  # False

As a function decorator:

import riemann as rm

@rm.enable_grad
def train_step(model, x, target, loss_fn):
    # Computations within the function will track gradients
    pred = model(x)
    loss = loss_fn(pred, target)
    loss.backward()
    return loss

set_grad_enabled() Context Manager/Decorator

set_grad_enabled(mode) is the most flexible gradient control function, allowing explicit enabling or disabling of gradient computation.

Parameters:

  • mode (bool): True to enable gradient computation, False to disable

As a context manager:

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)

# Disable gradients
with rm.set_grad_enabled(False):
    y = x * 2.
    print(y.requires_grad)  # False

# Enable gradients
with rm.set_grad_enabled(True):
    y = x * 2.
    print(y.requires_grad)  # True

As a function decorator:

import riemann as rm

@rm.set_grad_enabled(False)
def inference(model, x):
    return model(x)

@rm.set_grad_enabled(True)
def train(model, x, target, loss_fn):
    pred = model(x)
    loss = loss_fn(pred, target)
    loss.backward()
    return loss

Nested Context Managers

Gradient control context managers support nested usage, where inner contexts temporarily override outer settings:

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)

with rm.no_grad():  # Outer: disable gradients
    y1 = x * 2.
    print(f"Outer no_grad: y1.requires_grad = {y1.requires_grad}")  # False

    with rm.enable_grad():  # Inner: enable gradients
        y2 = x * 3.
        print(f"Inner enable_grad: y2.requires_grad = {y2.requires_grad}")  # True

    # Back to outer context
    y3 = x * 4.
    print(f"Back to outer: y3.requires_grad = {y3.requires_grad}")  # False

Tensor Methods for Graph Detaching and Data Copying

Riemann provides several tensor methods for managing computation graph dependencies, and copying tensor data. Each method has distinct characteristics related to:

  • Whether it creates a new tensor object or modifies in-place

  • Whether it shares data with the original tensor

  • Whether gradient tracking is preserved

Here are the key methods explained with individual examples:

  1. detach(): Create a new tensor that shares data with the original but is detached from the computation graph

The detach() method returns a new tensor object that shares the same data memory as the original tensor, but is disconnected from the computation graph. This means:

  • Changes to the detached tensor will modify the original tensor

  • No gradients will be backpropagated through the detached tensor

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)
y = x * 2.

# Detach y from the computation graph
detached_y = y.detach()

print(f"detached_y: {detached_y}")
print(f"detached_y.requires_grad: {detached_y.requires_grad}")
print(f"Modifying detached_y will modify y: {id(detached_y.data) == id(y.data)}")

Characteristics: Creates new tensor object, shares memory with original, disables gradient tracking

  1. detach_(): In-place operation that detaches the current tensor from the computation graph

The detach_() method is an in-place version of detach(). Instead of creating a new tensor, it modifies the current tensor to disconnect it from the computation graph.

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)
y = x * 2.

print(f"Before detach_(): y.requires_grad = {y.requires_grad}")
y.detach_()  # In-place operation
print(f"After detach_(): y.requires_grad = {y.requires_grad}")

Characteristics: Modifies tensor in-place (no new object), shares memory with original (same tensor), disables gradient tracking

  1. clone(): Create a new tensor with copied data that maintains computation graph dependencies

The clone() method creates a completely new tensor object with its own data memory, but preserves the computation graph dependencies from the original tensor. This means operations on the cloned tensor can backpropagate gradients to the original tensor.

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)
y = x * 2.

cloned_y = y.clone()

print(f"cloned_y: {cloned_y}")
print(f"cloned_y.requires_grad: {cloned_y.requires_grad}")
print(f"Modifying cloned_y won't modify y: {id(cloned_y.data) != id(y.data)}")

# Demonstrate gradient can propagate through cloned tensor to original tensor
loss = cloned_y.sum()
loss.backward()
print(f"x.grad after backward(): {x.grad}")  # Gradient propagates from cloned tensor to x

Characteristics: Creates new tensor object, copies data (no memory sharing), preserves gradient tracking

  1. copy(): Create a new tensor with copied data that is detached from the computation graph

The copy() method creates a new tensor object with its own data memory and is completely detached from the computation graph. This is equivalent to calling clone().detach_() and is useful for creating independent tensor copies without gradient tracking.

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)
y = x * 2.

copied_y = y.copy()

print(f"copied_y: {copied_y}")
print(f"copied_y.requires_grad: {copied_y.requires_grad}")
print(f"Modifying copied_y won't modify y: {id(copied_y.data) != id(y.data)}")

Characteristics: Creates new tensor object, copies data (no memory sharing), disables gradient tracking

  1. Key Differences Between Methods

The following table summarizes the key differences between these four methods:

Method

Creates New Object?

Shares Memory with Original Tensor?

Supports Gradient Tracking?

detach()

Yes

Yes

No

detach_()

No

N/A (same tensor)

No

clone()

Yes

No

Yes

copy()

Yes

No

No

import riemann as rm

x = rm.tensor([1., 2., 3.], requires_grad=True)

# Using detach() - creates new tensor, shares data, detached from graph
y1 = x.detach()
print(f"detach() result: y1 = {y1}, requires_grad={y1.requires_grad}")

# Using detach_() - in-place operation, modifies current tensor
x2 = rm.tensor([1., 2., 3.], requires_grad=True)
print(f"Before detach_(): x2.requires_grad={x2.requires_grad}")
x2.detach_()
print(f"After detach_(): x2.requires_grad={x2.requires_grad}")

# Using clone() - creates new tensor, copies data, maintains graph dependency
y2 = x.clone()
print(f"clone() result: y2 = {y2}, requires_grad={y2.requires_grad}")

# Using copy() - creates new tensor, copies data, detached from graph
y3 = x.copy()
print(f"copy() result: y3 = {y3}, requires_grad={y3.requires_grad}")

Key differences between these methods:

  • Data Sharing: detach() shares data with original, while clone() and copy() create new data copies

  • In-place Operation: detach_() modifies the tensor in-place, others create new tensors

  • Gradient Tracking: clone() maintains gradient tracking (if original requires it), others disable gradient tracking

  • Independent Copy: copy() creates a completely independent new tensor object that does not share data with the original tensor nor preserves computational graph dependencies

In-place Operations and Gradients

In-place operations can affect gradient computation. Here are important considerations:

  1. Leaf Variables with Gradient Tracking: In-place operations are NOT allowed on leaf tensors that require gradient tracking, as this would destroy the computational graph necessary for backpropagation.

  2. Non-Leaf Variables with Gradient Tracking: In-place operations are allowed on non-leaf tensors (intermediate results) that require gradient tracking.

Examples:

import riemann as rm

# 1. Example: In-place operations on leaf tensors are NOT allowed
x = rm.tensor([1., 2., 3.], requires_grad=True)  # Leaf tensor

try:
    x.add_(1.)  # This will raise an error
except RuntimeError as e:
    print(f"Error on leaf tensor in-place operation: {e}")

# 2. Example: In-place operations on non-leaf tensors ARE allowed
y = x * 2.  # Non-leaf tensor
print(f"Before in-place add on non-leaf tensor: y = {y}")
y.add_(3.)  # In-place operation on non-leaf tensor
print(f"After in-place add on non-leaf tensor: y = {y}")

# Compute gradient after in-place operation on non-leaf tensor
z = y.sum()
z.backward()
print(f"Gradient of x (leaf tensor): x.grad = {x.grad}")

# Clear gradients
x.grad.zero_()

# 3. Example: In-place assignment using tensor indexing on non-leaf tensors
y = x * 2.  # Non-leaf tensor
print(f"Before in-place indexing assignment: y = {y}")
y[0] = 100.  # In-place indexing assignment
print(f"After in-place indexing assignment: y = {y}")

# Compute gradient after indexing assignment
z = y.sum()
z.backward()
print(f"Gradient of x after indexing assignment: x.grad = {x.grad}")

# Clear gradients
x.grad.zero_()

# 4. Example: Gradient tracking with in-place operations
x = rm.tensor(2.0, requires_grad=True)  # Leaf tensor
y = rm.tensor(3.0, requires_grad=True)  # Leaf tensor

a = x * y  # Intermediate tensor
a.mul_(2.)  # In-place multiply
b = a + x  # Final tensor

b.backward()

print(f"Gradient of x (left value): x.grad = {x.grad}")
print(f"Gradient of y (right value): y.grad = {y.grad}")

Higher-Order Gradients

Riemann supports computing higher-order derivatives by setting create_graph=True:

import riemann as rm

# Create tensor with gradient tracking
x = rm.tensor(2.0, requires_grad=True)

# First-order computation
y = x ** 3.

# Compute first-order gradients with graph creation
dy_dx = rm.autograd.grad(y, x, create_graph=True)[0]
print(dy_dx)  # 12

# Compute second-order gradients
d2y_dx2 = rm.autograd.grad(dy_dx, x)[0]
print(d2y_dx2)  # 12

Additionally, Riemann provides two convenient tools for higher-order derivative computation: the d() method and higher_order_grad() function.

d() Method

The d() method of tensor objects is used to compute mixed partial derivatives of the current scalar tensor with respect to multiple scalar tensors. It allows for easy computation of multi-order mixed derivatives.

import riemann as rm

# Create tensors with gradient tracking
x = rm.tensor(2.0, requires_grad=True)
y = rm.tensor(3.0, requires_grad=True)

# Define function f = x^3 * y^2
f = x ** 3 * y ** 2

# Compute mixed partial derivative d²f/dxdy
d2f_dxdy = f.d(x, y)
print(d2f_dxdy)  # 72.0

# Compute third-order mixed partial derivative d³f/dx²dy
d3f_dx2dy = f.d(x, x, y)
print(d3f_dx2dy)  # 72.0

higher_order_grad() Function

The higher_order_grad() function is used to compute n-th order derivatives of scalar tensor outputs with respect to input tensors. It provides a convenient way to directly compute derivatives of a specified order.

import riemann as rm

# Create tensor with gradient tracking
x = rm.tensor(2.0, requires_grad=True)

# Define function y = x^3
y = x ** 3

# Compute second-order derivative
d2y_dx2 = rm.autograd.higher_order_grad(y, x, 2)[0]
print(d2y_dx2)  # 12.0

# Compute third-order derivative
d3y_dx3 = rm.autograd.higher_order_grad(y, x, 3)[0]
print(d3y_dx3)  # 6.0

# Multiple inputs case
x1 = rm.tensor(1.0, requires_grad=True)
x2 = rm.tensor(2.0, requires_grad=True)
z = x1 ** 2 + x2 ** 3
grads = rm.autograd.higher_order_grad(z, [x1, x2], 2)
print(grads)  # (2.0, 12.0)

Gradient functions (Functional API)

Riemann also provides a set of functional API functions in riemann.autograd.functional module for computing more advanced derivative structures, such as Jacobian matrices, Hessian matrices, Jacobian-vector products, etc.

jacobian() Function

The jacobian() function computes the Jacobian matrix of a function from input to output, showing all first-order partial derivatives of the function output with respect to the input.

import riemann as rm

# Define function f = x^2
def f(x):
    return x ** 2

# Create input tensor
x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Compute Jacobian matrix
jac = rm.autograd.functional.jacobian(f, x)
print(jac)
print(jac.shape)  # (3, 3)  # For vector input, shape is (n_inputs, n_outputs)

hessian() Function

The hessian() function computes the Hessian matrix of a scalar-valued function, showing all second-order partial derivatives of the function with respect to its inputs.

import riemann as rm

# Define function f = x^3
def f(x):
    return x ** 3

# Create input tensor
x = rm.tensor(2.0, requires_grad=True)

# Compute Hessian matrix
hess = rm.autograd.functional.hessian(f, x)
print(hess)
print(hess.shape)  # (1, 1)  # For scalar input, shape is (input_size, input_size)

derivative() Function

The derivative() function computes a derivative function for the given function. It creates a new function that, when called, returns the derivative of the original function at the specified inputs.

import riemann as rm

# Define function f = x^2
def f(x):
    return x ** 2.

# Create derivative function
df = rm.autograd.functional.derivative(f)

# Test the derivative function
x = rm.tensor(2.0, requires_grad=True)
print(df(x))  # Should return tensor(4.0)

# Multi-input example
def g(x, y):
    return x * y + x ** 2.

dg = rm.autograd.functional.derivative(g)
x = rm.tensor(2.0, requires_grad=True)
y = rm.tensor(3.0, requires_grad=True)
print(dg(x, y))

jvp() (Jacobian-Vector Product) Function

The jvp() function computes the product of a Jacobian matrix with a given vector.

import riemann as rm

# Define function f = x^2
def f(x):
    return x ** 2

# Create input tensor
x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Define v vector
v = rm.tensor([1.0, 1.0, 1.0])

# Compute jvp
f_x, jvp_val = rm.autograd.functional.jvp(f, x, v)
print(jvp_val)

vjp() (Vector-Jacobian Product) Function

The vjp() function computes the product of a given vector with a Jacobian matrix.

import riemann as rm

# Define function f = x^2
def f(x):
    return x ** 2

# Create input tensor
x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Define v vector
v = rm.tensor([1.0, 1.0, 1.0])

# Compute vjp
f_x, vjp_val = rm.autograd.functional.vjp(f, x, v)
print(vjp_val)

hvp() (Hessian-Vector Product) and vhp() Functions

The hvp() and vhp() functions compute Hessian-Vector Product and Vector-Hessian Product respectively. Since the Hessian matrix is symmetric, hvp() and vhp() are effectively the same.

import riemann as rm

# Define scalar-valued function
def f(x):
    return (x ** 3).sum()

# Create input tensor
x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Define v vector
v = rm.tensor([1.0, 1.0, 1.0])

# Compute hvp
f_x, hvp_val = rm.autograd.functional.hvp(f, x, v)
print(hvp_val)

# vhp computes the same result as hvp
f_x, vhp_val = rm.autograd.functional.vhp(f, x, v)
print(vhp_val)

Custom Gradient Functions

Riemann provides three ways to implement custom functions with gradient tracking support:

  1. Using Riemann Tensor Functions (Automatic Gradients) If you implement your custom function using existing Riemann tensor functions, you get gradient tracking automatically without writing any gradient code:

    import riemann as rm
    
    def my_custom_function(x):
        """A custom function that automatically gets gradient support"""
        return rm.exp(rm.sin(x)) + x**2.
    
    # Test automatic gradient tracking
    x = rm.tensor(1.0, requires_grad=True)
    y = my_custom_function(x)
    y.backward()
    print(f"Gradient: {x.grad}")  # Will automatically compute correct gradient
    
  2. Using track_grad Decorator Use the track_grad decorator to wrap your function and provide explicit gradient computation.

    Gradient Function Interface Requirements:

    The gradient function passed to track_grad must follow these interface requirements:

    • Parameters: Must accept the same parameters as the forward function (same names and order)

    • Return Value: Must return a tuple containing the gradient (partial derivative) for each input tensor

    • Tuple Elements: Each element corresponds to the gradient of the respective input tensor. For tensors that don’t require gradients, return None for that position

    • Gradient Calculation: The gradient should be computed as the partial derivative of the output with respect to each input

    Example for single input:

    import riemann as rm
    import numpy as np
    
    def sigmoid_derivative(x):
        """Gradient function for sigmoid: returns tuple with one element"""
        sig = 1. / (1. + np.exp(-x.data))
        return (rm.tensor(sig * (1. - sig)),)  # Note: must return a tuple
    
    @rm.track_grad(sigmoid_derivative)
    def custom_sigmoid(x):
        """Custom sigmoid function with gradient support"""
        return rm.tensor(1. / (1. + np.exp(-x.data)))
    
    # Test custom sigmoid with gradient
    x = rm.tensor(0.0, requires_grad=True)
    y = custom_sigmoid(x)
    y.backward()
    print(f"Sigmoid output: {y}")  # Should be 0.5
    print(f"Sigmoid gradient: {x.grad}")  # Should be 0.25
    

    Example for multiple inputs:

    import riemann as rm
    
    def multiply_derivative(x, y):
        """Gradient function for multiplication: d(xy)/dx = y, d(xy)/dy = x"""
        return (y, x)  # Returns tuple with gradient for each input
    
    @rm.track_grad(multiply_derivative)
    def custom_multiply(x, y):
        """Custom multiplication function with gradient support"""
        return x * y
    
    # Test with multiple inputs
    x = rm.tensor(2.0, requires_grad=True)
    y = rm.tensor(3.0, requires_grad=True)
    z = custom_multiply(x, y)
    z.backward()
    print(f"z = {z}")  # Should be 6.0
    print(f"dz/dx = {x.grad}")  # Should be 3.0 (y)
    print(f"dz/dy = {y.grad}")  # Should be 2.0 (x)
    
  3. Using Function Class For more complex cases, you can subclass Function and implement both forward and backward static methods.

    Function Class Interface:

    To create a custom function using the Function class, you must implement two static methods:

    forward(ctx, *inputs)

    • Purpose: Performs the forward computation

    • Parameters:

      • ctx: Context object used to save information for the backward pass. Use ctx.save_for_backward() to store tensors needed in backward

      • *inputs: Input tensors (variable number of arguments)

    • Returns: Output tensor(s) of the forward computation

    • Usage: Implement your custom computation logic here and save any tensors needed for gradient computation using ctx.save_for_backward()

    backward(ctx, grad_output)

    • Purpose: Performs the backward (gradient) computation

    • Parameters:

      • ctx: Context object containing information saved during forward pass. Access saved tensors via ctx.saved_tensors

      • grad_output: Gradient of the output tensor (from subsequent layers in the computation graph)

    • Returns: Tuple of gradients, one for each input tensor. Each gradient should be the product of grad_output and the local gradient (partial derivative)

    • Usage: Compute gradients using the chain rule: grad_input = grad_output * local_gradient

    Example:

    import riemann as rm
    import numpy as np
    
    class CustomSigmoid(rm.autograd.Function):
        @staticmethod
        def forward(ctx, x):
            """Forward computation for sigmoid
    
            Args:
                ctx: Context object for saving tensors
                x: Input tensor
    
            Returns:
                Output tensor after applying sigmoid
            """
            sig = 1. / (1. + np.exp(-x.data))
            ctx.save_for_backward(rm.tensor(sig))  # Save for backward
            return rm.tensor(sig)
    
        @staticmethod
        def backward(ctx, grad_output):
            """Backward computation for sigmoid
    
            Args:
                ctx: Context object with saved tensors
                grad_output: Gradient from output side
    
            Returns:
                Gradient with respect to input
            """
            sig, = ctx.saved_tensors  # Retrieve saved tensor
            # Chain rule: grad_input = grad_output * local_gradient
            # local_gradient for sigmoid: sig * (1 - sig)
            return grad_output * sig * (1. - sig)
    
    # Test CustomSigmoid
    x = rm.tensor(0.0, requires_grad=True)
    y = CustomSigmoid.apply(x)  # Use apply() to call the function
    y.backward()
    print(f"Sigmoid output: {y}")  # Should be 0.5
    print(f"Sigmoid gradient: {x.grad}")  # Should be 0.25
    

    Key Points:

    • Always use @staticmethod decorator for both forward and backward methods

    • Use ctx.save_for_backward() in forward to save tensors needed for gradient computation

    • Access saved tensors in backward via ctx.saved_tensors (returns a tuple)

    • The backward method must return a tuple with one gradient for each input to forward

    • Call the function using ClassName.apply(*inputs), not by instantiating the class

Advanced Computational Graph Manipulation

Riemann provides functions for manually manipulating the computational graph. These functions are designed for special use cases where you need to connect tensors to the computational graph without affecting forward computation values or backward gradient values. These are low-level tools typically used in framework internals (such as Riemann’s hook handling mechanism) rather than common user scenarios.

share_grad_map Function

The share_grad_map function fully connects and maps a group of tensors to another group of tensors with the same count. The corresponding positions between the two groups have an identity mapping: forward pass transparently transmits tensor values, backward pass transparently transmits gradient values (equivalent to a clone relationship). For other connections, forward pass transmits zero values (i.e., does not affect the new tensor values), backward pass transmits zero gradients.

Purpose: Ensure all tensors in a group participate in the computational graph and receive gradients (zero for tensors not directly involved in computation) rather than None. This is particularly useful when you want certain tensors to receive zero gradients without changing the existing computational graph’s forward or backward computation values.

Core Mechanism:

  1. For each tensor that requires gradients, create a clone. The cloned tensor depends on the original tensor through the clone operation (gradient passes through)

  2. Attach all other tensors (excluding itself) as zero-gradient sources to the cloned tensor

  3. This way, each tensor maintains its gradient relationship with the original tensor while forming zero-gradient connections with other tensors

Parameters:

  • tensors: A tuple or list of tensors to connect. Must be tuple or list (not set) to preserve order.

Returns: A tuple or list of tensors with the same values but connected to a shared computational graph. Note: tensors with requires_grad=True are cloned (not modified in place).

Behavior:

  • Tensors with requires_grad=True are cloned, and all other tensors are attached as zero-gradient sources to the cloned tensor

  • Tensors without gradients or non-TN objects remain unchanged

  • All connected tensors receive zero gradients from each other

Example:

import riemann as rm

a = rm.tensor([1.0, 2.0], requires_grad=True)
b = rm.tensor([3.0, 4.0], requires_grad=True)
c = rm.tensor([5.0, 6.0], requires_grad=True)

# Define a function that only uses a and b
def func(a, b, c):
    return (a * b).sum()

# Before share_grad_map: c doesn't participate, receives None
y1 = func(a, b, c)
y1.backward()
print(f"c.grad = {c.grad}")  # Output: None

# Reset tensors
a = rm.tensor([1.0, 2.0], requires_grad=True)
b = rm.tensor([3.0, 4.0], requires_grad=True)
c = rm.tensor([5.0, 6.0], requires_grad=True)

# After share_grad_map: all tensors connected, c receives zero gradient
a_new, b_new, c_new = rm.share_grad_map((a, b, c))
y2 = func(a_new, b_new, c_new)
y2.backward()
print(f"c.grad = {c_new.grad}")  # Output: [0.0, 0.0]

# Verify: forward values are identical, a and b gradients unchanged
assert float(y1.data) == float(y2.data)
assert (a_new.grad == rm.tensor([3., 4.])).all()
assert (b_new.grad == rm.tensor([1., 2.])).all()

Typical Use Cases:

  1. Module Hook Handling: In Riemann’s module hook mechanism, share_grad_map is used to create new module output tensors to replace the original output tensors. When only some tensors of a module participate in or contribute to the loss function computation, share_grad_map produces a new module output without changing forward computation or backward gradient values. This ensures that output tensors that previously didn’t participate in loss computation can now receive zero gradients, and input tensors depending on these outputs will also receive zero gradients.

  2. Multi-task Learning: When some parameters don’t participate in certain task’s loss computation but you want them to receive zero gradients rather than None for gradient accumulation.

  3. Conditional Computation: When some tensors are conditionally used in forward pass but you want consistent gradient behavior regardless of the condition.

  4. Gradient Monitoring: When you want to monitor gradients of all parameters in a group, even those not directly involved in a specific computation.

Note: This is a low-level function for manually building computational graphs. Most users should rely on Riemann’s automatic graph construction rather than using this function directly.

Supporting Functions and Methods

The following functions and methods are used internally by share_grad_map and are rarely needed directly by users:

fwbw_all_zero Function

Returns a scalar tensor with value 0.0 in forward pass and returns a zero tensor with the same shape as input in backward pass. Used to add a tensor to the computational graph without affecting forward or backward values.

attach_zero_grad_sources Method

Attaches multiple tensors as source tensors to a tensor. This doesn’t change the tensor’s value, but allows it to pass zero gradients to these sources during backward pass. Used internally to connect tensors to the computational graph so they receive zero gradients instead of None.

Gradient Checking

Use the gradcheck function to verify your custom gradient functions are correct:

import riemann as rm

# Define a test function for gradcheck
def test_function(x):
    return CustomSigmoid.apply(x)

# Perform gradient check
x = rm.tensor(0.0, requires_grad=True)
check_passed = rm.gradcheck(test_function, (x,))
print(f"Gradient check passed: {check_passed}")

Gradcheck verifies that your analytical gradient computation matches the numerical gradient computed using finite difference method.

Gradient Computation Tips

  1. Memory Management: Gradient computation uses memory to store the computational graph. Use no_grad() or detach() when you don’t need gradients to save memory.

Common Pitfalls

  1. In-place Operations: Avoid performing in-place operations on leaf node tensors that require gradient tracking.

  2. Detaching Tensors from Computational Graph: After detaching, tensors lose their computational graph dependencies and cannot perform backward propagation for gradient calculation.

  3. Non-scalar Outputs: Remember to provide gradient arguments when calling backward() on non-scalar outputs.

  4. Memory Leaks: Long-running computations with gradient tracking can consume significant memory.

Examples

Rosenbrock Function Optimization (Banana Function)

Rosenbrock function (also known as banana function) is a classic non-convex optimization problem. The function has its minimum at (1, 1) with value 0.

Here’s an example of optimizing the Rosenbrock function using Riemann’s automatic differentiation and Adam optimizer:

import riemann as rm
from riemann import optim

# Define the Rosenbrock function (banana function)
def rosenbrock_2d(x, y):
    """Rosenbrock function for 2D case"""
    return 100. * (y - x**2.)**2. + (1. - x)**2.

# Initialize parameters with gradient tracking
x = rm.tensor(-1.2, requires_grad=True)  # Start from point (-1.2, 1.0)
y = rm.tensor(1.0, requires_grad=True)
params = [x, y]

# Setup optimizer
optimizer = optim.Adam(params, lr=0.05)

print("Optimizing Rosenbrock function (banana function):")
print(f"Initial x: {x.item():.4f}, y: {y.item():.4f}")
print(f"Initial loss: {rosenbrock_2d(x, y).item():.4f}")

# Perform optimization
for i in range(1000):
    loss = rosenbrock_2d(x, y)

    # Reset gradients
    optimizer.zero_grad()

    # Compute gradients automatically
    loss.backward()

    # Update parameters
    optimizer.step()

    # Print progress every 200 iterations
    if i % 200 == 0:
        print(f"Iteration {i}: loss = {loss.item():.8f}, x = {x.item():.8f}, y = {y.item():.8f}")

# Print final results
print(f"\nOptimization completed!")
print(f"Final x: {x.item():.10f}, y: {y.item():.10f}")
print(f"Final loss: {loss.item():.10f}")
print(f"Theoretical minimum: x=1.0, y=1.0, loss=0.0")