Automatic Differentiation Basics
================================

Riemann's automatic differentiation engine automatically records tensor computations, building a computation graph, and efficiently computes derivatives through backpropagation. This is essential for training neural networks and other optimization tasks.

**Core Concepts**

- **Computation Graph**: A directed graph automatically constructed by Riemann in the background that records the relationships between tensor operations. Each node represents a tensor, and edges represent operations.

- **Forward Pass**: The process of executing operations starting from input tensors along the computation graph to obtain the final output.

- **Backward Propagation (Backprop)**: The process of propagating gradients backward along the computation graph starting from the output tensor to compute derivatives for each input tensor.

- **Gradient**: The partial derivative of a scalar output tensor with respect to other tensors, representing the rate of change of the output relative to the input.

- **Leaf Node Tensor**: A tensor created directly by the user (e.g., through ``rm.tensor()``) with ``requires_grad=True``. These are typically model parameters.

- **Intermediate Node Tensor**: A tensor created as a result of operations on other tensors. By default, gradients for intermediate nodes are not retained.

**Gradient Computation Methods**

Riemann provides two methods for computing gradients:

1. **backward() method**: Suitable for computing gradients of multiple tensors at once. After calling, gradients for all participating leaf node tensors are computed and stored in their respective ``grad`` attributes.

2. **grad() function**: Suitable for computing gradients of specific tensors. Allows precise control over which tensors' gradients to compute, returning a tuple of gradients without modifying the tensors' ``grad`` attributes.

Gradient Tracking Switch
------------------------

By default, tensors don't track their gradients. To enable gradient tracking, set ``requires_grad=True`` when creating a tensor:

.. code-block:: python

    import riemann as rm
    
    # Tensor without gradient tracking
    x = rm.tensor([1., 2., 3.])
    print(x.requires_grad)  # False
    
    # Tensor with gradient tracking
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    print(x.requires_grad)  # True

You can also enable or disable gradient tracking on existing tensors:

.. code-block:: python

    x = rm.tensor([1., 2., 3.])
    print(x.requires_grad)  # False
    
    # Enable gradient tracking
    x.requires_grad_(True)
    print(x.requires_grad)  # True

Computing Gradients
-------------------

Riemann provides two methods for computing gradients: the ``backward()`` method and the ``grad()`` function.

Using the backward() Method
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``backward()`` method is suitable for computing gradients of multiple tensors at once. After calling, gradients are automatically stored in the ``grad`` attributes of participating leaf node tensors.

**Function Signature**:

.. code-block:: python

    tensor_object.backward(gradient=None, retain_graph=False, create_graph=False)

**Parameters**:

- **gradient** (optional): When the output tensor is not a scalar, a gradient tensor with the same shape as the output is required. For scalar outputs, this parameter can be omitted, defaulting to ``None`` (equivalent to passing scalar 1).
- **retain_graph** (optional): Whether to retain the computation graph. Defaults to ``False``, meaning the graph is released after backpropagation. Set to ``True`` if you need to call ``backward()`` multiple times.
- **create_graph** (optional): Whether to record the computation graph of gradients for subsequent computation of higher-order derivatives, defaults to ``False``.

**Use Cases**:

- Training neural networks, computing gradients for all trainable parameters at once
- When multiple backward passes are needed (e.g., gradient accumulation)
- Computing higher-order derivatives

**Important Notes**:

- Only **leaf node tensors** with ``requires_grad=True`` will have their gradients computed
- **Intermediate node tensors** do not retain gradients by default; call ``retain_grad()`` if you need gradients for intermediate nodes
- Gradients accumulate in the ``grad`` attribute; manually zero gradients before multiple ``backward()`` calls

**Example 1: Gradient Computation for Scalar Output**

.. code-block:: python

    import riemann as rm
    
    # Create tensors with gradient tracking (leaf nodes)
    x = rm.tensor(2.0, requires_grad=True)
    y = rm.tensor(3.0, requires_grad=True)
    
    # Define computation (intermediate node)
    z = x * y + x ** 2.
    
    # Compute gradients
    z.backward()
    
    # Access gradients
    print(x.grad)  # dz/dx = y + 2*x = 3 + 4 = 7
    print(y.grad)  # dz/dy = x = 2

**Example 2: Gradient Computation for Non-Scalar Output**

.. code-block:: python

    import riemann as rm
    
    # Create tensors with gradient tracking
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    
    # Define computation that produces a non-scalar output
    y = x * 2.
    
    # Compute gradients with respect to a vector, gradient argument required
    gradient = rm.tensor([1., 1., 1.])  # Vector for Jacobian-vector product
    y.backward(gradient)
    
    # Access gradients
    print(x.grad)  # [2., 2., 2.]

**Example 3: Retaining Gradients for Intermediate Nodes**

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor(2.0, requires_grad=True)
    y = x * 3  # Intermediate node
    z = y ** 2  # Output
    
    # Retain gradients for intermediate node y
    y.retain_grad()
    
    z.backward()
    
    print(x.grad)  # dz/dx = 36
    print(y.grad)  # dz/dy = 12 (because retain_grad() was called)

Using the grad() Function
~~~~~~~~~~~~~~~~~~~~~~~~~

The ``grad()`` function is suitable for computing gradients of specific tensors, allowing precise control over which tensors' gradients to compute.

**Function Signature**:

.. code-block:: python

    riemann.autograd.grad(outputs, inputs, grad_outputs=None, retain_graph=False, create_graph=False, allow_unused=False)

**Parameters**:

- **outputs**: Output tensor(s) (scalar or tensor), the starting point for gradient computation
- **inputs**: Input tensor or tuple of tensors, specifying which tensors to compute gradients for
- **grad_outputs** (optional): Gradient tensor required when ``outputs`` is not a scalar
- **retain_graph** (optional): Whether to retain the computation graph, defaults to ``False``
- **create_graph** (optional): Whether to record the computation graph of gradients for subsequent computation of higher-order derivatives, defaults to ``False``
- **allow_unused** (optional): Whether to allow some input tensors to be unused, defaults to ``False``

**Use Cases**:

- When you only need gradients for specific tensors, not all leaf nodes
- When you don't want to modify the ``grad`` attributes of tensors
- When you need more flexible control over the gradient computation process

**Important Notes**:

- Gradients are returned as a **tuple**, in the same order as the ``inputs`` parameter
- Only tensors specified in ``inputs`` will have their gradients computed
- **Does not modify** the ``grad`` attributes of input tensors
- Intermediate nodes, even with ``retain_grad()`` called, will not automatically have gradients computed in ``grad()``; they must be explicitly specified

**Example 1: Computing Gradients for Specific Tensors**

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor(2.0, requires_grad=True)
    y = rm.tensor(3.0, requires_grad=True)
    z = rm.tensor(4.0, requires_grad=True)
    
    # Define computation
    w = x * y + z
    
    # Only compute gradients for x and y, not z
    grads = rm.autograd.grad(w, (x, y))
    
    print(grads)  # (tensor(3.), tensor(2.))
    print(x.grad)  # None (grad() does not modify grad attributes)

**Example 2: Gradient Computation for Non-Scalar Output**

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    y = x * 2
    
    # For non-scalar outputs, grad_outputs must be provided
    grad_outputs = rm.tensor([1., 1., 1.])
    grads = rm.autograd.grad(y, x, grad_outputs=grad_outputs)
    
    print(grads)  # (tensor([2., 2., 2.]),)

Gradient Accumulation
---------------------

Gradients are accumulated by default. This means that if you call ``backward()`` multiple times, the gradients will add up:

.. code-block:: python

    import riemann as rm
    
    # Create tensor with gradient tracking
    x = rm.tensor(1.0, requires_grad=True)
    
    # First computation
    y = x * 2.
    y.backward()
    print(x.grad)  # 2
    
    # Second computation
    y = x * 3.
    y.backward()
    print(x.grad)  # 2 + 3 = 5 (gradients accumulate)
    
    # Clear gradients
    if x.grad is not None:
        x.grad.zero_()
    print(x.grad)  # 0

Gradient Computation Context Control
------------------------------------

Riemann provides a flexible gradient computation context control mechanism through functions and context managers, allowing convenient enabling or disabling of gradient tracking. This is useful in model inference (where gradients should be disabled to save memory) and training (where gradients are needed) scenarios.

is_grad_enabled() Function
~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``is_grad_enabled()`` function checks whether gradient computation is currently enabled.

.. code-block:: python

    import riemann as rm
    
    # Check current gradient status
    print(rm.is_grad_enabled())  # True (enabled by default)
    
    with rm.no_grad():
        print(rm.is_grad_enabled())  # False

no_grad() Context Manager/Decorator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``no_grad()`` temporarily disables gradient computation. In this context, all computations will not track gradients, which is suitable for inference phases and can significantly reduce memory usage and accelerate computation.

**As a context manager:**

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    
    with rm.no_grad():
        y = x * 2.
        print(y.requires_grad)  # False

**As a function decorator:**

.. code-block:: python

    import riemann as rm
    
    @rm.no_grad
    def inference(model, x):
        # Computations within the function will not track gradients
        return model(x)

enable_grad() Context Manager/Decorator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``enable_grad()`` temporarily enables gradient computation. Can be used to temporarily enable gradients within a ``no_grad`` context.

**As a context manager:**

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    
    with rm.no_grad():
        # Gradients are disabled here
        print(rm.is_grad_enabled())  # False
        
        with rm.enable_grad():
            # Gradients are temporarily enabled here
            y = x * 2.
            print(y.requires_grad)  # True
        
        # Back to disabled state
        print(rm.is_grad_enabled())  # False

**As a function decorator:**

.. code-block:: python

    import riemann as rm
    
    @rm.enable_grad
    def train_step(model, x, target, loss_fn):
        # Computations within the function will track gradients
        pred = model(x)
        loss = loss_fn(pred, target)
        loss.backward()
        return loss

set_grad_enabled() Context Manager/Decorator
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``set_grad_enabled(mode)`` is the most flexible gradient control function, allowing explicit enabling or disabling of gradient computation.

**Parameters:**

- **mode** (bool): ``True`` to enable gradient computation, ``False`` to disable

**As a context manager:**

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    
    # Disable gradients
    with rm.set_grad_enabled(False):
        y = x * 2.
        print(y.requires_grad)  # False
    
    # Enable gradients
    with rm.set_grad_enabled(True):
        y = x * 2.
        print(y.requires_grad)  # True

**As a function decorator:**

.. code-block:: python

    import riemann as rm
    
    @rm.set_grad_enabled(False)
    def inference(model, x):
        return model(x)
    
    @rm.set_grad_enabled(True)
    def train(model, x, target, loss_fn):
        pred = model(x)
        loss = loss_fn(pred, target)
        loss.backward()
        return loss

Nested Context Managers
~~~~~~~~~~~~~~~~~~~~~~~

Gradient control context managers support nested usage, where inner contexts temporarily override outer settings:

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    
    with rm.no_grad():  # Outer: disable gradients
        y1 = x * 2.
        print(f"Outer no_grad: y1.requires_grad = {y1.requires_grad}")  # False
        
        with rm.enable_grad():  # Inner: enable gradients
            y2 = x * 3.
            print(f"Inner enable_grad: y2.requires_grad = {y2.requires_grad}")  # True
        
        # Back to outer context
        y3 = x * 4.
        print(f"Back to outer: y3.requires_grad = {y3.requires_grad}")  # False

Tensor Methods for Graph Detaching and Data Copying
---------------------------------------------------

Riemann provides several tensor methods for managing computation graph dependencies, and copying tensor data. Each method has distinct characteristics related to:

- Whether it creates a new tensor object or modifies in-place
- Whether it shares data with the original tensor
- Whether gradient tracking is preserved

Here are the key methods explained with individual examples:

1. **detach()**: Create a new tensor that shares data with the original but is detached from the computation graph

The detach() method returns a new tensor object that shares the same data memory as the original tensor, but is disconnected from the computation graph. This means:

- Changes to the detached tensor will modify the original tensor
- No gradients will be backpropagated through the detached tensor

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    y = x * 2.
    
    # Detach y from the computation graph
    detached_y = y.detach()
    
    print(f"detached_y: {detached_y}")
    print(f"detached_y.requires_grad: {detached_y.requires_grad}")
    print(f"Modifying detached_y will modify y: {id(detached_y.data) == id(y.data)}")

**Characteristics**: Creates new tensor object, shares memory with original, disables gradient tracking

2. **detach_()**: In-place operation that detaches the current tensor from the computation graph

The detach_() method is an in-place version of detach(). Instead of creating a new tensor, it modifies the current tensor to disconnect it from the computation graph.

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    y = x * 2.
    
    print(f"Before detach_(): y.requires_grad = {y.requires_grad}")
    y.detach_()  # In-place operation
    print(f"After detach_(): y.requires_grad = {y.requires_grad}")

**Characteristics**: Modifies tensor in-place (no new object), shares memory with original (same tensor), disables gradient tracking

3. **clone()**: Create a new tensor with copied data that maintains computation graph dependencies

The clone() method creates a completely new tensor object with its own data memory, but preserves the computation graph dependencies from the original tensor. This means operations on the cloned tensor can backpropagate gradients to the original tensor.

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    y = x * 2.
    
    cloned_y = y.clone()
    
    print(f"cloned_y: {cloned_y}")
    print(f"cloned_y.requires_grad: {cloned_y.requires_grad}")
    print(f"Modifying cloned_y won't modify y: {id(cloned_y.data) != id(y.data)}")
    
    # Demonstrate gradient can propagate through cloned tensor to original tensor
    loss = cloned_y.sum()
    loss.backward()
    print(f"x.grad after backward(): {x.grad}")  # Gradient propagates from cloned tensor to x

**Characteristics**: Creates new tensor object, copies data (no memory sharing), preserves gradient tracking

4. **copy()**: Create a new tensor with copied data that is detached from the computation graph

The copy() method creates a new tensor object with its own data memory and is completely detached from the computation graph. This is equivalent to calling clone().detach_() and is useful for creating independent tensor copies without gradient tracking.

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    y = x * 2.
    
    copied_y = y.copy()
    
    print(f"copied_y: {copied_y}")
    print(f"copied_y.requires_grad: {copied_y.requires_grad}")
    print(f"Modifying copied_y won't modify y: {id(copied_y.data) != id(y.data)}")

**Characteristics**: Creates new tensor object, copies data (no memory sharing), disables gradient tracking

5. Key Differences Between Methods

The following table summarizes the key differences between these four methods:

+----------------+----------------------+------------------------+-------------------------------+
| Method         | Creates New Object?  | Shares Memory with     | Supports Gradient Tracking?   |
|                |                      | Original Tensor?       |                               |
+================+======================+========================+===============================+
| detach()       | Yes                  | Yes                    | No                            |
+----------------+----------------------+------------------------+-------------------------------+
| detach_()      | No                   | N/A (same tensor)      | No                            |
+----------------+----------------------+------------------------+-------------------------------+
| clone()        | Yes                  | No                     | Yes                           |
+----------------+----------------------+------------------------+-------------------------------+
| copy()         | Yes                  | No                     | No                            |
+----------------+----------------------+------------------------+-------------------------------+

.. code-block:: python

    import riemann as rm
    
    x = rm.tensor([1., 2., 3.], requires_grad=True)
    
    # Using detach() - creates new tensor, shares data, detached from graph
    y1 = x.detach()
    print(f"detach() result: y1 = {y1}, requires_grad={y1.requires_grad}")
    
    # Using detach_() - in-place operation, modifies current tensor
    x2 = rm.tensor([1., 2., 3.], requires_grad=True)
    print(f"Before detach_(): x2.requires_grad={x2.requires_grad}")
    x2.detach_()
    print(f"After detach_(): x2.requires_grad={x2.requires_grad}")
    
    # Using clone() - creates new tensor, copies data, maintains graph dependency
    y2 = x.clone()
    print(f"clone() result: y2 = {y2}, requires_grad={y2.requires_grad}")
    
    # Using copy() - creates new tensor, copies data, detached from graph
    y3 = x.copy()
    print(f"copy() result: y3 = {y3}, requires_grad={y3.requires_grad}")

Key differences between these methods:

- **Data Sharing**: detach() shares data with original, while clone() and copy() create new data copies
- **In-place Operation**: detach_() modifies the tensor in-place, others create new tensors
- **Gradient Tracking**: clone() maintains gradient tracking (if original requires it), others disable gradient tracking
- **Independent Copy**: copy() creates a completely independent new tensor object that does not share data with the original tensor nor preserves computational graph dependencies

In-place Operations and Gradients
---------------------------------

In-place operations can affect gradient computation. Here are important considerations:

1. **Leaf Variables with Gradient Tracking**: In-place operations are NOT allowed on leaf tensors that require gradient tracking, as this would destroy the computational graph necessary for backpropagation.
2. **Non-Leaf Variables with Gradient Tracking**: In-place operations are allowed on non-leaf tensors (intermediate results) that require gradient tracking.

Examples:

.. code-block:: python

    import riemann as rm
    
    # 1. Example: In-place operations on leaf tensors are NOT allowed
    x = rm.tensor([1., 2., 3.], requires_grad=True)  # Leaf tensor
    
    try:
        x.add_(1.)  # This will raise an error
    except RuntimeError as e:
        print(f"Error on leaf tensor in-place operation: {e}")
    
    # 2. Example: In-place operations on non-leaf tensors ARE allowed
    y = x * 2.  # Non-leaf tensor
    print(f"Before in-place add on non-leaf tensor: y = {y}")
    y.add_(3.)  # In-place operation on non-leaf tensor
    print(f"After in-place add on non-leaf tensor: y = {y}")
    
    # Compute gradient after in-place operation on non-leaf tensor
    z = y.sum()
    z.backward()
    print(f"Gradient of x (leaf tensor): x.grad = {x.grad}")
    
    # Clear gradients
    x.grad.zero_()
    
    # 3. Example: In-place assignment using tensor indexing on non-leaf tensors
    y = x * 2.  # Non-leaf tensor
    print(f"Before in-place indexing assignment: y = {y}")
    y[0] = 100.  # In-place indexing assignment
    print(f"After in-place indexing assignment: y = {y}")
    
    # Compute gradient after indexing assignment
    z = y.sum()
    z.backward()
    print(f"Gradient of x after indexing assignment: x.grad = {x.grad}")
    
    # Clear gradients
    x.grad.zero_()
    
    # 4. Example: Gradient tracking with in-place operations
    x = rm.tensor(2.0, requires_grad=True)  # Leaf tensor
    y = rm.tensor(3.0, requires_grad=True)  # Leaf tensor
    
    a = x * y  # Intermediate tensor
    a.mul_(2.)  # In-place multiply
    b = a + x  # Final tensor
    
    b.backward()
    
    print(f"Gradient of x (left value): x.grad = {x.grad}")
    print(f"Gradient of y (right value): y.grad = {y.grad}")

Higher-Order Gradients
----------------------

Riemann supports computing higher-order derivatives by setting ``create_graph=True``:

.. code-block:: python

    import riemann as rm
    
    # Create tensor with gradient tracking
    x = rm.tensor(2.0, requires_grad=True)
    
    # First-order computation
    y = x ** 3.
    
    # Compute first-order gradients with graph creation
    dy_dx = rm.autograd.grad(y, x, create_graph=True)[0]
    print(dy_dx)  # 12
    
    # Compute second-order gradients
    d2y_dx2 = rm.autograd.grad(dy_dx, x)[0]
    print(d2y_dx2)  # 12

Additionally, Riemann provides two convenient tools for higher-order derivative computation: the ``d()`` method and ``higher_order_grad()`` function.

``d()`` Method
~~~~~~~~~~~~~~

The ``d()`` method of tensor objects is used to compute mixed partial derivatives of the current scalar tensor with respect to multiple scalar tensors. It allows for easy computation of multi-order mixed derivatives.

.. code-block:: python

    import riemann as rm
    
    # Create tensors with gradient tracking
    x = rm.tensor(2.0, requires_grad=True)
    y = rm.tensor(3.0, requires_grad=True)
    
    # Define function f = x^3 * y^2
    f = x ** 3 * y ** 2
    
    # Compute mixed partial derivative d²f/dxdy
    d2f_dxdy = f.d(x, y)
    print(d2f_dxdy)  # 72.0
    
    # Compute third-order mixed partial derivative d³f/dx²dy
    d3f_dx2dy = f.d(x, x, y)
    print(d3f_dx2dy)  # 72.0

``higher_order_grad()`` Function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``higher_order_grad()`` function is used to compute n-th order derivatives of scalar tensor outputs with respect to input tensors. It provides a convenient way to directly compute derivatives of a specified order.

.. code-block:: python

    import riemann as rm
    
    # Create tensor with gradient tracking
    x = rm.tensor(2.0, requires_grad=True)
    
    # Define function y = x^3
    y = x ** 3
    
    # Compute second-order derivative
    d2y_dx2 = rm.autograd.higher_order_grad(y, x, 2)[0]
    print(d2y_dx2)  # 12.0
    
    # Compute third-order derivative
    d3y_dx3 = rm.autograd.higher_order_grad(y, x, 3)[0]
    print(d3y_dx3)  # 6.0
    
    # Multiple inputs case
    x1 = rm.tensor(1.0, requires_grad=True)
    x2 = rm.tensor(2.0, requires_grad=True)
    z = x1 ** 2 + x2 ** 3
    grads = rm.autograd.higher_order_grad(z, [x1, x2], 2)
    print(grads)  # (2.0, 12.0)

Gradient functions (Functional API)
-----------------------------------

Riemann also provides a set of functional API functions in ``riemann.autograd.functional`` module for computing more advanced derivative structures, such as Jacobian matrices, Hessian matrices, Jacobian-vector products, etc.

``jacobian()`` Function
~~~~~~~~~~~~~~~~~~~~~~~~

The ``jacobian()`` function computes the Jacobian matrix of a function from input to output, showing all first-order partial derivatives of the function output with respect to the input.

.. code-block:: python

    import riemann as rm
    
    # Define function f = x^2
    def f(x):
        return x ** 2
    
    # Create input tensor
    x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)
    
    # Compute Jacobian matrix
    jac = rm.autograd.functional.jacobian(f, x)
    print(jac)
    print(jac.shape)  # (3, 3)  # For vector input, shape is (n_inputs, n_outputs)

``hessian()`` Function
~~~~~~~~~~~~~~~~~~~~~~

The ``hessian()`` function computes the Hessian matrix of a scalar-valued function, showing all second-order partial derivatives of the function with respect to its inputs.

.. code-block:: python

    import riemann as rm
    
    # Define function f = x^3
    def f(x):
        return x ** 3
    
    # Create input tensor
    x = rm.tensor(2.0, requires_grad=True)
    
    # Compute Hessian matrix
    hess = rm.autograd.functional.hessian(f, x)
    print(hess)
    print(hess.shape)  # (1, 1)  # For scalar input, shape is (input_size, input_size)

``derivative()`` Function
~~~~~~~~~~~~~~~~~~~~~~~~~

The ``derivative()`` function computes a derivative function for the given function. It creates a new function that, when called, returns the derivative of the original function at the specified inputs.

.. code-block:: python

    import riemann as rm
    
    # Define function f = x^2
    def f(x):
        return x ** 2.
    
    # Create derivative function
    df = rm.autograd.functional.derivative(f)
    
    # Test the derivative function
    x = rm.tensor(2.0, requires_grad=True)
    print(df(x))  # Should return tensor(4.0)
    
    # Multi-input example
    def g(x, y):
        return x * y + x ** 2.
    
    dg = rm.autograd.functional.derivative(g)
    x = rm.tensor(2.0, requires_grad=True)
    y = rm.tensor(3.0, requires_grad=True)
    print(dg(x, y))

``jvp()`` (Jacobian-Vector Product) Function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``jvp()`` function computes the product of a Jacobian matrix with a given vector.

.. code-block:: python

    import riemann as rm
    
    # Define function f = x^2
    def f(x):
        return x ** 2
    
    # Create input tensor
    x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)
    
    # Define v vector
    v = rm.tensor([1.0, 1.0, 1.0])
    
    # Compute jvp
    f_x, jvp_val = rm.autograd.functional.jvp(f, x, v)
    print(jvp_val)

``vjp()`` (Vector-Jacobian Product) Function
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``vjp()`` function computes the product of a given vector with a Jacobian matrix.

.. code-block:: python

    import riemann as rm
    
    # Define function f = x^2
    def f(x):
        return x ** 2
    
    # Create input tensor
    x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)
    
    # Define v vector
    v = rm.tensor([1.0, 1.0, 1.0])
    
    # Compute vjp
    f_x, vjp_val = rm.autograd.functional.vjp(f, x, v)
    print(vjp_val)

``hvp()`` (Hessian-Vector Product) and ``vhp()`` Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``hvp()`` and ``vhp()`` functions compute Hessian-Vector Product and Vector-Hessian Product respectively. Since the Hessian matrix is symmetric, ``hvp()`` and ``vhp()`` are effectively the same.

.. code-block:: python

    import riemann as rm
    
    # Define scalar-valued function
    def f(x):
        return (x ** 3).sum()
    
    # Create input tensor
    x = rm.tensor([1.0, 2.0, 3.0], requires_grad=True)
    
    # Define v vector
    v = rm.tensor([1.0, 1.0, 1.0])
    
    # Compute hvp
    f_x, hvp_val = rm.autograd.functional.hvp(f, x, v)
    print(hvp_val)

    # vhp computes the same result as hvp
    f_x, vhp_val = rm.autograd.functional.vhp(f, x, v)
    print(vhp_val)

Custom Gradient Functions
-------------------------

Riemann provides three ways to implement custom functions with gradient tracking support:

1. **Using Riemann Tensor Functions (Automatic Gradients)**
   If you implement your custom function using existing Riemann tensor functions, you get gradient tracking automatically without writing any gradient code:
   
   .. code-block:: python

       import riemann as rm
       
       def my_custom_function(x):
           """A custom function that automatically gets gradient support"""
           return rm.exp(rm.sin(x)) + x**2.
       
       # Test automatic gradient tracking
       x = rm.tensor(1.0, requires_grad=True)
       y = my_custom_function(x)
       y.backward()
       print(f"Gradient: {x.grad}")  # Will automatically compute correct gradient

2. **Using track_grad Decorator**
   Use the ``track_grad`` decorator to wrap your function and provide explicit gradient computation.

   **Gradient Function Interface Requirements:**

   The gradient function passed to ``track_grad`` must follow these interface requirements:

   - **Parameters**: Must accept the same parameters as the forward function (same names and order)
   - **Return Value**: Must return a ``tuple`` containing the gradient (partial derivative) for each input tensor
   - **Tuple Elements**: Each element corresponds to the gradient of the respective input tensor. For tensors that don't require gradients, return ``None`` for that position
   - **Gradient Calculation**: The gradient should be computed as the partial derivative of the output with respect to each input

   **Example for single input:**

   .. code-block:: python

       import riemann as rm
       import numpy as np
       
       def sigmoid_derivative(x):
           """Gradient function for sigmoid: returns tuple with one element"""
           sig = 1. / (1. + np.exp(-x.data))
           return (rm.tensor(sig * (1. - sig)),)  # Note: must return a tuple
       
       @rm.track_grad(sigmoid_derivative)
       def custom_sigmoid(x):
           """Custom sigmoid function with gradient support"""
           return rm.tensor(1. / (1. + np.exp(-x.data)))
       
       # Test custom sigmoid with gradient
       x = rm.tensor(0.0, requires_grad=True)
       y = custom_sigmoid(x)
       y.backward()
       print(f"Sigmoid output: {y}")  # Should be 0.5
       print(f"Sigmoid gradient: {x.grad}")  # Should be 0.25

   **Example for multiple inputs:**

   .. code-block:: python

       import riemann as rm
       
       def multiply_derivative(x, y):
           """Gradient function for multiplication: d(xy)/dx = y, d(xy)/dy = x"""
           return (y, x)  # Returns tuple with gradient for each input
       
       @rm.track_grad(multiply_derivative)
       def custom_multiply(x, y):
           """Custom multiplication function with gradient support"""
           return x * y
       
       # Test with multiple inputs
       x = rm.tensor(2.0, requires_grad=True)
       y = rm.tensor(3.0, requires_grad=True)
       z = custom_multiply(x, y)
       z.backward()
       print(f"z = {z}")  # Should be 6.0
       print(f"dz/dx = {x.grad}")  # Should be 3.0 (y)
       print(f"dz/dy = {y.grad}")  # Should be 2.0 (x)

3. **Using Function Class**
   For more complex cases, you can subclass ``Function`` and implement both ``forward`` and ``backward`` static methods.

   **Function Class Interface:**

   To create a custom function using the ``Function`` class, you must implement two static methods:

   **forward(ctx, *inputs)**

   - **Purpose**: Performs the forward computation
   - **Parameters**:

     - ``ctx``: Context object used to save information for the backward pass. Use ``ctx.save_for_backward()`` to store tensors needed in backward
     - ``*inputs``: Input tensors (variable number of arguments)
   - **Returns**: Output tensor(s) of the forward computation
   - **Usage**: Implement your custom computation logic here and save any tensors needed for gradient computation using ``ctx.save_for_backward()``

   **backward(ctx, grad_output)**

   - **Purpose**: Performs the backward (gradient) computation
   - **Parameters**:

     - ``ctx``: Context object containing information saved during forward pass. Access saved tensors via ``ctx.saved_tensors``
     - ``grad_output``: Gradient of the output tensor (from subsequent layers in the computation graph)
   - **Returns**: Tuple of gradients, one for each input tensor. Each gradient should be the product of ``grad_output`` and the local gradient (partial derivative)
   - **Usage**: Compute gradients using the chain rule: ``grad_input = grad_output * local_gradient``

   **Example:**

   .. code-block:: python

       import riemann as rm
       import numpy as np
       
       class CustomSigmoid(rm.autograd.Function):
           @staticmethod
           def forward(ctx, x):
               """Forward computation for sigmoid
               
               Args:
                   ctx: Context object for saving tensors
                   x: Input tensor
               
               Returns:
                   Output tensor after applying sigmoid
               """
               sig = 1. / (1. + np.exp(-x.data))
               ctx.save_for_backward(rm.tensor(sig))  # Save for backward
               return rm.tensor(sig)
           
           @staticmethod
           def backward(ctx, grad_output):
               """Backward computation for sigmoid
               
               Args:
                   ctx: Context object with saved tensors
                   grad_output: Gradient from output side
               
               Returns:
                   Gradient with respect to input
               """
               sig, = ctx.saved_tensors  # Retrieve saved tensor
               # Chain rule: grad_input = grad_output * local_gradient
               # local_gradient for sigmoid: sig * (1 - sig)
               return grad_output * sig * (1. - sig)
       
       # Test CustomSigmoid
       x = rm.tensor(0.0, requires_grad=True)
       y = CustomSigmoid.apply(x)  # Use apply() to call the function
       y.backward()
       print(f"Sigmoid output: {y}")  # Should be 0.5
       print(f"Sigmoid gradient: {x.grad}")  # Should be 0.25

   **Key Points:**

   - Always use ``@staticmethod`` decorator for both ``forward`` and ``backward`` methods
   - Use ``ctx.save_for_backward()`` in ``forward`` to save tensors needed for gradient computation
   - Access saved tensors in ``backward`` via ``ctx.saved_tensors`` (returns a tuple)
   - The ``backward`` method must return a tuple with one gradient for each input to ``forward``
   - Call the function using ``ClassName.apply(*inputs)``, not by instantiating the class

Advanced Computational Graph Manipulation
-----------------------------------------

Riemann provides functions for manually manipulating the computational graph. These functions are designed for special use cases where you need to connect tensors to the computational graph without affecting forward computation values or backward gradient values. These are low-level tools typically used in framework internals (such as Riemann's hook handling mechanism) rather than common user scenarios.

share_grad_map Function
~~~~~~~~~~~~~~~~~~~~~~~

The ``share_grad_map`` function fully connects and maps a group of tensors to another group of tensors with the same count. The corresponding positions between the two groups have an identity mapping: forward pass transparently transmits tensor values, backward pass transparently transmits gradient values (equivalent to a clone relationship). For other connections, forward pass transmits zero values (i.e., does not affect the new tensor values), backward pass transmits zero gradients.

**Purpose:**
Ensure all tensors in a group participate in the computational graph and receive gradients (zero for tensors not directly involved in computation) rather than None. This is particularly useful when you want certain tensors to receive zero gradients without changing the existing computational graph's forward or backward computation values.

**Core Mechanism:**

1. For each tensor that requires gradients, create a clone. The cloned tensor depends on the original tensor through the ``clone`` operation (gradient passes through)
2. Attach all other tensors (excluding itself) as zero-gradient sources to the cloned tensor
3. This way, each tensor maintains its gradient relationship with the original tensor while forming zero-gradient connections with other tensors

**Parameters:**

- ``tensors``: A tuple or list of tensors to connect. Must be tuple or list (not set) to preserve order.

**Returns:**
A tuple or list of tensors with the same values but connected to a shared computational graph. Note: tensors with ``requires_grad=True`` are cloned (not modified in place).

**Behavior:**

- Tensors with ``requires_grad=True`` are cloned, and all other tensors are attached as zero-gradient sources to the cloned tensor
- Tensors without gradients or non-TN objects remain unchanged
- All connected tensors receive zero gradients from each other

**Example:**

.. code-block:: python

    import riemann as rm
    
    a = rm.tensor([1.0, 2.0], requires_grad=True)
    b = rm.tensor([3.0, 4.0], requires_grad=True)
    c = rm.tensor([5.0, 6.0], requires_grad=True)
    
    # Define a function that only uses a and b
    def func(a, b, c):
        return (a * b).sum()
    
    # Before share_grad_map: c doesn't participate, receives None
    y1 = func(a, b, c)
    y1.backward()
    print(f"c.grad = {c.grad}")  # Output: None
    
    # Reset tensors
    a = rm.tensor([1.0, 2.0], requires_grad=True)
    b = rm.tensor([3.0, 4.0], requires_grad=True)
    c = rm.tensor([5.0, 6.0], requires_grad=True)
    
    # After share_grad_map: all tensors connected, c receives zero gradient
    a_new, b_new, c_new = rm.share_grad_map((a, b, c))
    y2 = func(a_new, b_new, c_new)
    y2.backward()
    print(f"c.grad = {c_new.grad}")  # Output: [0.0, 0.0]
    
    # Verify: forward values are identical, a and b gradients unchanged
    assert float(y1.data) == float(y2.data)
    assert (a_new.grad == rm.tensor([3., 4.])).all()
    assert (b_new.grad == rm.tensor([1., 2.])).all()

**Typical Use Cases:**

1. **Module Hook Handling**: In Riemann's module hook mechanism, ``share_grad_map`` is used to create new module output tensors to replace the original output tensors. When only some tensors of a module participate in or contribute to the loss function computation, ``share_grad_map`` produces a new module output without changing forward computation or backward gradient values. This ensures that output tensors that previously didn't participate in loss computation can now receive zero gradients, and input tensors depending on these outputs will also receive zero gradients.

2. **Multi-task Learning**: When some parameters don't participate in certain task's loss computation but you want them to receive zero gradients rather than None for gradient accumulation.

3. **Conditional Computation**: When some tensors are conditionally used in forward pass but you want consistent gradient behavior regardless of the condition.

4. **Gradient Monitoring**: When you want to monitor gradients of all parameters in a group, even those not directly involved in a specific computation.

**Note:** This is a low-level function for manually building computational graphs. Most users should rely on Riemann's automatic graph construction rather than using this function directly.

Supporting Functions and Methods
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following functions and methods are used internally by ``share_grad_map`` and are rarely needed directly by users:

**fwbw_all_zero Function**

Returns a scalar tensor with value 0.0 in forward pass and returns a zero tensor with the same shape as input in backward pass. Used to add a tensor to the computational graph without affecting forward or backward values.

**attach_zero_grad_sources Method**

Attaches multiple tensors as source tensors to a tensor. This doesn't change the tensor's value, but allows it to pass zero gradients to these sources during backward pass. Used internally to connect tensors to the computational graph so they receive zero gradients instead of None.

Gradient Checking
-----------------

Use the ``gradcheck`` function to verify your custom gradient functions are correct:

.. code-block:: python

    import riemann as rm
    
    # Define a test function for gradcheck
    def test_function(x):
        return CustomSigmoid.apply(x)
    
    # Perform gradient check
    x = rm.tensor(0.0, requires_grad=True)
    check_passed = rm.gradcheck(test_function, (x,))
    print(f"Gradient check passed: {check_passed}")

Gradcheck verifies that your analytical gradient computation matches the numerical gradient computed using finite difference method.

Gradient Computation Tips
-------------------------

1. **Memory Management**: Gradient computation uses memory to store the computational graph. Use ``no_grad()`` or ``detach()`` when you don't need gradients to save memory.

Common Pitfalls
------------------

1. **In-place Operations**: Avoid performing in-place operations on leaf node tensors that require gradient tracking.

2. **Detaching Tensors from Computational Graph**: After detaching, tensors lose their computational graph dependencies and cannot perform backward propagation for gradient calculation.

3. **Non-scalar Outputs**: Remember to provide gradient arguments when calling ``backward()`` on non-scalar outputs.

4. **Memory Leaks**: Long-running computations with gradient tracking can consume significant memory.

Examples
--------

Rosenbrock Function Optimization (Banana Function)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Rosenbrock function (also known as banana function) is a classic non-convex optimization problem. The function has its minimum at (1, 1) with value 0.

Here's an example of optimizing the Rosenbrock function using Riemann's automatic differentiation and Adam optimizer:

.. code-block:: python

    import riemann as rm
    from riemann import optim

    # Define the Rosenbrock function (banana function)
    def rosenbrock_2d(x, y):
        """Rosenbrock function for 2D case"""
        return 100. * (y - x**2.)**2. + (1. - x)**2.

    # Initialize parameters with gradient tracking
    x = rm.tensor(-1.2, requires_grad=True)  # Start from point (-1.2, 1.0)
    y = rm.tensor(1.0, requires_grad=True)
    params = [x, y]

    # Setup optimizer
    optimizer = optim.Adam(params, lr=0.05)

    print("Optimizing Rosenbrock function (banana function):")
    print(f"Initial x: {x.item():.4f}, y: {y.item():.4f}")
    print(f"Initial loss: {rosenbrock_2d(x, y).item():.4f}")

    # Perform optimization
    for i in range(1000):
        loss = rosenbrock_2d(x, y)
        
        # Reset gradients
        optimizer.zero_grad()
        
        # Compute gradients automatically
        loss.backward()
        
        # Update parameters
        optimizer.step()
        
        # Print progress every 200 iterations
        if i % 200 == 0:
            print(f"Iteration {i}: loss = {loss.item():.8f}, x = {x.item():.8f}, y = {y.item():.8f}")

    # Print final results
    print(f"\nOptimization completed!")
    print(f"Final x: {x.item():.10f}, y: {y.item():.10f}")
    print(f"Final loss: {loss.item():.10f}")
    print(f"Theoretical minimum: x=1.0, y=1.0, loss=0.0")