18 Hands-On: Getting Started with Gnu-RL

18.1 Overview

Learning Objectives

By the end of this activity, students will be able to:

Set up the Gnu-RL codebase with modern PyTorch and understand its repository structure
Run the Differentiable MPC solver on a simplified state-space environment
Explain how the MPC policy works as a PyTorch module and how gradients flow backward through the solver
Trace the imitation learning pipeline: how expert data trains the dynamics model (\(F\), \(B_d\)) through backpropagation through the MPC
Trace the online PPO pipeline: how the agent refines its dynamics model through environment interaction
Interpret the pre-computed results from the original authors’ experiments
Identify the environment interface calls that must be replaced to apply Gnu-RL to a different environment (e.g., SustainDC for Assignment 3)

Prerequisites

This tutorial assumes familiarity with:

The Paper 3: Gnu-RL chapter, which covers the paper’s contributions, the Differentiable MPC policy, and the two-phase training approach
The SustainDC hands-on tutorial, which covers the data center environment you will use in Assignment 3
The Assignment 3 instructions, so you understand what you are building toward
Basic PyTorch: tensors, requires_grad, optimizers, and loss.backward()
Model Predictive Control (MPC) concepts from Lectures 8–9

Run This on Your Own Machine

The code in this notebook requires the Gnu-RL repository and its dependencies. It is not executed during the book build. Follow along by running each cell in your own Python environment.

Important: The original Gnu-RL code requires EnergyPlus (a building energy simulation engine) for its full training pipeline. We will not use EnergyPlus in this tutorial. Instead, we work with:

The Differentiable MPC module (diff_mpc/) — pure PyTorch, no external dependencies
A simplified state-space environment (ssM_env/) — a lightweight linear system simulator
Pre-computed results from the original authors’ experiments

The original codebase was written for PyTorch 1.0 (2018). A pytorch2-compat branch with all necessary patches is provided so you don’t need to modify anything manually.

18.2 Environment Setup

18.2.1 Cloning the Repository

Clone the pytorch2-compat branch, which contains the original code with compatibility patches for modern PyTorch (\(\geq\) 2.0):

git clone -b pytorch2-compat https://github.com/INFERLab/Gnu-RL.git gnu-rl
cd gnu-rl

What changed in the pytorch2-compat branch?

The diff_mpc/ module required four categories of updates to work with PyTorch 2.x:

Autograd Function API (lqr_step.py): PyTorch 2.x requires torch.autograd.Function subclasses to use @staticmethod for forward and backward. The old instance-method style was removed. The patched code splits LQRStep into a wrapper class (holds config) and _LQRStepFunction (static autograd methods that receive config via ctx).
Removed linear algebra ops (pnqp.py, lqr_step.py): torch.btrifact() and torch.btrisolve() were removed in favor of torch.linalg.lu_factor() and torch.linalg.lu_solve().
Boolean tensor handling (util.py, pnqp.py, lqr_step.py): .byte() masks replaced with proper torch.bool tensors; arithmetic negation 1 - mask replaced with bitwise ~mask.
Variable wrapping (mpc.py): torch.autograd.Variable was a no-op wrapper since PyTorch 0.4 and has been fully removed. All Variable(...) calls replaced with direct tensor operations.

None of these changes affect the mathematical behavior of the solver — they are purely API migrations.

18.2.2 Installing Dependencies

Initialize a uv project and install the required packages:

uv init --python 3.10
uv add torch numpy pandas matplotlib requests

18.2.3 Verifying the Setup

A quick sanity check — import the MPC solver and run a trivial optimization:

import torch
from diff_mpc.mpc import MPC, QuadCost, LinDx

# 1D system: x_{t+1} = 0.9*x + 0.1*u, drive state toward 24
n_state, n_ctrl, T = 1, 1, 5
F = torch.zeros(T-1, 1, n_state, n_state + n_ctrl)
F[:, :, 0, 0] = 0.9  # state dynamics
F[:, :, 0, 1] = 0.1  # control influence
f = torch.zeros(T-1, 1, n_state)

C = torch.zeros(T, 1, n_state + n_ctrl, n_state + n_ctrl)
C[:, :, 0, 0] = 1.0   # penalize state deviation
C[:, :, 1, 1] = 0.01  # penalize control effort
c = torch.zeros(T, 1, n_state + n_ctrl)
c[:, :, 0] = -24.0     # target state = 24

x_init = torch.tensor([[20.0]])

solver = MPC(n_state, n_ctrl, T,
             u_lower=0.0, u_upper=50.0,
             lqr_iter=20, verbose=0,
             exit_unconverged=False)
x, u, costs = solver(x_init, QuadCost(C, c), LinDx(F, f))

print(f"Initial state: {x_init[0, 0]:.1f}")
print(f"Planned states:   {x[:, 0, 0].detach().numpy()}")
print(f"Planned controls: {u[:, 0, 0].detach().numpy()}")

You should see the MPC driving the state from 20 toward 24 by applying large control actions early, then tapering off.

18.3 Repository Walkthrough

The repository has three main parts: the environment-agnostic MPC solver (diff_mpc/), the environment-specific training scripts (agent/), and a lightweight test environment (ssM_env/). For Assignment 3, you will reuse diff_mpc/ as-is and adapt the training logic from agent/ to work with SustainDC instead of EnergyPlus.

18.3.1 Key Files

gnu-rl/
├── diff_mpc/                # Differentiable MPC solver (pure PyTorch) ← KEEP AS-IS
│   ├── mpc.py              # MPC class: orchestrates iLQR iterations
│   ├── lqr_step.py         # Single LQR backward/forward sweep (autograd Function)
│   ├── pnqp.py             # Box-constrained QP solver for control bounds
│   ├── dynamics.py         # Dynamics model wrappers (LinDx, etc.)
│   └── util.py             # Batch matrix utilities (bmv, bger, bdiag)
│
├── ssM_env/                 # Simplified state-space environment (no EnergyPlus)
│   └── env.py              # Linear system: x_{t+1} = Ax + Bu*u + Bd*d
│
├── agent/                   # Training scripts ← ADAPT FOR SUSTAINDC
│   ├── simulate.py         # Step 0: Run baseline controller, collect data
│   ├── Imit_EP.py          # Step 1: Offline imitation learning (Learner class)
│   ├── PPO_MPC_EP.py       # Step 2: Online PPO refinement
│   ├── utils.py            # Reward function, data utilities
│   └── results/            # Pre-computed results from the authors
│
├── eplus_env/              # EnergyPlus Gym wrapper (not used)
└── Demo.ipynb              # Results visualization notebook

18.3.2 The Two-Phase Pipeline in Code

The two training phases from the paper map directly to files:

Phase	File	What it does
0. Data collection	`agent/simulate.py`	Runs the existing controller in EnergyPlus, logs \((x, u, d)\) trajectories
1. Imitation learning	`agent/Imit_EP.py`	`Learner` class with learnable \(F\), \(B_d\); trains by backpropagating \(\mathcal{L}_{imit}\) through the MPC solver
2. Online PPO	`agent/PPO_MPC_EP.py`	Wraps MPC output in a Gaussian policy; updates \(F\), \(B_d\) via PPO to maximize reward

The diff_mpc/ module is called by both phases but is never modified during training — only the dynamics parameters (\(F\), \(B_d\)) that are passed into it are updated.

18.4 Understanding the Pre-Computed Results

The authors provide pre-computed results in agent/results/ so you can visualize the full training pipeline without running EnergyPlus. The code below is adapted from Demo.ipynb. Run these cells to reproduce the four key plots from the paper.

18.4.1 Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

filePath = "agent/results/"
lam = 10  # weight on state loss in the imitation objective

18.4.2 Imitation Learning Loss Curves

This plot shows how \(\mathcal{L}_{state}\), \(\mathcal{L}_{action}\), and the combined \(\mathcal{L}_{imit} = \lambda \mathcal{L}_{state} + \mathcal{L}_{action}\) evolve over 20 epochs. Look for the train/validation gap — a large gap would indicate overfitting of the dynamics parameters.

imit_loss = pd.read_pickle(filePath + "Imit_loss_rl.pkl")

fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for ax, title, train_key, val_key in [
    (axes[0], r"$\mathcal{L}_{state}$", "train_state_loss", "val_state_loss"),
    (axes[1], r"$\mathcal{L}_{action}$", "train_action_loss", "val_action_loss"),
]:
    ax.plot(imit_loss[train_key], label="train")
    ax.plot(imit_loss[val_key], label="val")
    ax.set_title(title)
    ax.set_xlabel("Epoch")
    ax.set_xlim((0, 19))
    ax.legend()

# Combined loss
axes[2].plot(lam * imit_loss["train_state_loss"] + imit_loss["train_action_loss"], label="train")
axes[2].plot(lam * imit_loss["val_state_loss"] + imit_loss["val_action_loss"], label="val")
axes[2].set_title(r"$\mathcal{L}_{imit}$")
axes[2].set_xlabel("Epoch")
axes[2].set_xlim((0, 19))
axes[2].legend()
plt.tight_layout()
plt.show()

18.4.3 Agent Behavior After Pretraining

After imitation learning, the agent should track the expert controller’s states and actions closely. Each epoch’s results are saved as Imit_rl_{epoch}.pkl — try different epochs to see how tracking improves during training.

epoch = 19  # best epoch (last one); try 0 vs. 19 to see improvement
imit_record = pd.read_pickle(filePath + f"Imit_rl_{epoch}.pkl")

# Setpoint and occupancy live in the simulation dataset, not the imitation record
sim_data = pd.read_pickle(filePath + "Sim-TMY2.pkl").loc[imit_record.index]

# Reset to integer index so set_xlim works with integer bounds
imit_record = imit_record.reset_index(drop=True)
sim_data = sim_data.reset_index(drop=True)

start_time, end_time = 0, 500  # adjust to zoom in/out

fig, axes = plt.subplots(2, 1, figsize=(20, 5), sharex=True)
axes[0].plot(imit_record["Expert nState"], label="Expert")
axes[0].plot(imit_record["Learner nState"], label="Learner")
axes[0].plot(sim_data["Indoor Temp. Setpoint"], "k--", label="Setpoint")
axes[0].set_ylabel("Next State (Temperature)")
axes[0].set_xlim((start_time, end_time))
axes[0].legend()

axes[1].plot(imit_record["Expert action"], label="Expert")
axes[1].plot(imit_record["Learner action"], label="Learner")
axes[1].plot(sim_data["Occupancy Flag"] * 5, "k--", label="Occupancy")
axes[1].set_ylabel("Action (Supply Temp.)")
axes[1].set_xlim((start_time, end_time))
axes[1].legend()
plt.tight_layout()
plt.show()

18.4.4 Online Learning Performance

After PPO refinement, compare Gnu-RL against the EnergyPlus baseline controller. Notice how Gnu-RL learns to pre-heat before occupancy periods — the agent anticipates the setpoint change rather than reacting to it.

baseline = pd.read_pickle(filePath + "Sim-TMY3.pkl").reset_index(drop=True)
rl = pd.read_pickle(filePath + "perf_rl_obs.pkl").reset_index(drop=True)

start_time, end_time = 2000, 2500  # a representative week

fig, axes = plt.subplots(2, 1, figsize=(20, 6), sharex=True)
axes[0].plot(baseline["Indoor Temp."], "b-", label="Baseline")
axes[0].plot(rl["Indoor Temp."], "r-", label="Gnu-RL")
axes[0].plot(rl["Indoor Temp. Setpoint"], "k--", label="Setpoint")
axes[0].set_ylabel("Indoor Temperature")
axes[0].set_xlim([start_time, end_time])
axes[0].legend()

axes[1].plot(baseline["Sys Out Temp."], "b", label="Baseline")
axes[1].plot(rl["Sys Out Temp."], "r", label="Gnu-RL")
axes[1].plot(baseline["Occupancy Flag"] * 30, "k--", label="Occupancy")
axes[1].set_ylabel("Supply Air Temp. (Action)")
axes[1].set_xlim([start_time, end_time])
axes[1].legend()
plt.tight_layout()
plt.show()

18.4.5 Residue Reward Over Time

The residue reward is the difference between Gnu-RL’s reward and the baseline’s reward at each timestep. A positive trend confirms that PPO is improving the policy over the 90-day online learning period.

rl_perf = np.load(filePath + "perf_rl.npy", allow_pickle=True)
baseline_perf = pd.read_pickle(filePath + "Sim-TMY3.pkl")

# Compute residue reward (moving average with confidence band)
residue = pd.Series(rl_perf[:, 0])  # reward column (one value per episode/day)
RewardStats = residue.rolling(window=7).agg(["mean", "std"]).dropna()  # 7-day moving average
RewardStats.columns = ["Mean", "Std"]

fig = plt.figure(figsize=(10, 4))
plt.plot(RewardStats["Mean"])
plt.fill_between(
    RewardStats.index,
    RewardStats["Mean"] - 1.645 * RewardStats["Std"],
    RewardStats["Mean"] + 1.645 * RewardStats["Std"],
    alpha=0.2,
)
plt.axhline(y=0, color="k", linestyle="--")
plt.ylabel("Residue Reward")
plt.xlabel("Episode (day)")
plt.tight_layout()
plt.show()

18.5 Code Deep-Dive: The Differentiable MPC Policy

You do not need to modify anything inside diff_mpc/. What you do need is a clear understanding of the interface — what you pass in and what you get back — so you can wire it into your own training loop for SustainDC.

18.5.1 The MPC Interface

The solver is called as:

x, u, _ = mpc.MPC(n_state, n_ctrl, T, **options)(x_init, QuadCost(C, c), LinDx(F, f))

The “Shape” column below gives the PyTorch tensor dimensions. These are 3D or 4D tensors because the solver is vectorized over a batch dimension (\(B\)) and a time dimension (\(T\)). Let \(n = n_{state}\), \(m = n_{ctrl}\), \(B = n_{batch}\):

Argument	Shape	Dimensions	Description
`x_init`	`[n_batch, n_state]`	\(\in \mathbb{R}^{B \times n}\)	Current state (e.g., zone temperatures)
`C`	`[T, n_batch, n_state+n_ctrl, n_state+n_ctrl]`	\(\in \mathbb{R}^{T \times B \times (n+m) \times (n+m)}\)	Quadratic cost matrix (diagonal: weights on states and controls)
`c`	`[T, n_batch, n_state+n_ctrl]`	\(\in \mathbb{R}^{T \times B \times (n+m)}\)	Linear cost vector (encodes target setpoints)
`F`	`[T-1, n_batch, n_state, n_state+n_ctrl]`	\(\in \mathbb{R}^{(T-1) \times B \times n \times (n+m)}\)	Dynamics matrix: \(x_{t+1} = F \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t\)
`f`	`[T-1, n_batch, n_state]`	\(\in \mathbb{R}^{(T-1) \times B \times n}\)	Affine dynamics term (disturbance contribution: \(B_d d_t\))

Return	Shape	Dimensions	Description
`x`	`[T, n_batch, n_state]`	\(\in \mathbb{R}^{T \times B \times n}\)	Optimal state trajectory
`u`	`[T, n_batch, n_ctrl]`	\(\in \mathbb{R}^{T \times B \times m}\)	Optimal control trajectory (apply `u[0]` only — receding horizon)

Key solver options: u_lower/u_upper (control bounds), lqr_iter (number of iLQR iterations, typically 20), exit_unconverged=False (don’t error if solver doesn’t fully converge).

18.5.2 How \(F\) and \(f\) Encode the Dynamics

The dynamics \(x_{t+1} = Ax_t + B_u u_t + B_d d_t\) are split into two parts for the solver:

\(F\): The concatenation \([A \mid B_u]\), shape [n_state, n_state + n_ctrl]. This is learnable.
\(f\): The disturbance term \(B_d d_t\), pre-computed for the planning horizon. \(B_d\) is learnable; \(d_t\) comes from weather/workload forecasts.

In code (from Imit_EP.py):

# F_hat is [n_state, n_state + n_ctrl] --- the learnable dynamics
F_hat = torch.ones((n_state, n_state + n_ctrl)).double().requires_grad_()

# Bd_hat is [n_state, n_dist] --- the learnable disturbance matrix
Bd_hat = torch.tensor(np.random.rand(n_state, n_dist)).requires_grad_()

# f is computed by multiplying Bd_hat with the disturbance forecast
dt = ...  # [n_dist, T-1] disturbance values over the horizon
ft = torch.mm(Bd_hat, dt).transpose(0, 1).unsqueeze(1)  # [T-1, 1, n_state]

For Assignment 3, you will define what \(x\), \(u\), and \(d\) mean for SustainDC’s cooling environment, then initialize F_hat and Bd_hat with appropriate shapes.

18.5.3 How \(C\) and \(c\) Encode the Cost Function

The MPC minimizes \(\sum_t \frac{1}{2} \tau_t^T C_t \tau_t + c_t^T \tau_t\) where \(\tau_t = [x_t; u_t]\). In Gnu-RL, the cost balances comfort against energy:

# Diagonal of C: [eta, eta, ..., 0.001, 0.001, ...]
#                 ^-- n_state --^  ^-- n_ctrl --^
# eta depends on occupancy: high when occupied, low when not
diag[:, :n_state] = eta_w_flag   # comfort weight (varies by timestep)
diag[:, n_state:] = 0.001        # energy weight (constant)

# Linear cost c encodes the setpoint target
c[:, :n_state] = -eta * x_target  # drives state toward setpoint
c[:, n_state:] = 1                # L1 penalty on control effort

For Assignment 3, you will design a cost function appropriate for data center cooling — balancing temperature constraint satisfaction against energy consumption (and potentially carbon intensity).

18.5.4 Differentiability

The solver’s backward pass (in lqr_step.py) computes \(\partial u^* / \partial F\) and \(\partial u^* / \partial B_d\) by differentiating through the KKT conditions of the optimization. You don’t need to understand the internals — just know that if F_hat and Bd_hat have requires_grad=True, then calling loss.backward() on any loss computed from the MPC output will produce gradients for these parameters.

18.6 Code Deep-Dive: Imitation Learning

agent/Imit_EP.py contains the Learner class that implements Phase 1 of the training pipeline. This is the file you will adapt most directly for Assignment 3.

18.6.1 The `Learner` Class

The class has four key methods:

Method	What it does	Assignment 3 adaptation
`__init__`	Initializes `F_hat`, `Bd_hat` with `requires_grad=True`; sets up Adam optimizer	Change shapes to match SustainDC’s state/control/disturbance dimensions
`Cost_function(cur_time)`	Builds \(C\), \(c\) matrices from occupancy schedule and setpoint targets	Replace with SustainDC’s cost structure (temperature bounds, energy, carbon)
`forward(x_init, C, c, cur_time)`	Calls `mpc.MPC(...)` and returns next predicted state + first control action	Replace disturbance lookup (EnergyPlus timestamps \(\to\) SustainDC observations)
`predict(x_init, action, cur_time)`	One-step dynamics: \(\hat{x}_{t+1} = F \cdot [x_t; u_t] + B_d \cdot d_t\)	Same timestamp adaptation as `forward`
`update_parameters(...)`	Computes \(\mathcal{L}_{imit}\), calls `backward()`, steps optimizer	Reusable as-is (loss structure doesn’t depend on environment)

18.6.2 The Training Loop

The loop in main() follows this structure per epoch:

Sample a random timestep from the training set
Get expert data: the existing controller’s state \(x_t\) and action \(u_t\) at that timestep
Run MPC (learner.forward): given \(x_t\), produce the learner’s predicted action \(\hat{u}_t\)
Predict next state (learner.predict): given \(x_t\) and the expert’s \(u_t\), produce \(\hat{x}_{t+1}\)
Every 256 samples: compute loss and backpropagate

The key detail: step 3 uses the MPC to generate \(\hat{u}\), while step 4 uses one-step dynamics to generate \(\hat{x}\). Both pathways produce gradients for F_hat and Bd_hat.

For Assignment 3, you will collect expert demonstrations by running your baseline controller in SustainDC, then adapt this loop to sample from that collected data.

18.6.3 The Imitation Loss

state_loss  = mean((x_true - x_pred)^2)
action_loss = mean((u_true - u_pred)^2)
traj_loss   = lambda * state_loss + action_loss  # lambda = eta = 5

The \(\lambda\) parameter balances two objectives:

State loss (dynamics accuracy): does the learned model predict the right next state?
Action loss (behavioral cloning): does the MPC produce the same action as the expert?

Both matter. Good dynamics alone don’t guarantee good control (this is the paper’s key finding vs. system identification).

18.7 Code Deep-Dive: Online PPO

agent/PPO_MPC_EP.py contains the PPO class that implements Phase 2. It reuses the same F_hat/Bd_hat parameters (loaded from imitation learning weights) and continues updating them via policy gradient.

18.7.1 From MPC Solution to Stochastic Policy

The MPC produces a deterministic action \(u^*\). PPO needs a stochastic policy for exploration. The solution is simple — wrap \(u^*\) in a Gaussian:

# In PPO.select_action():
dist = Normal(mu=u_mpc, sigma=sigma)  # sigma decays over episodes
action = dist.sample()
log_prob = dist.log_prob(action)

\(\sigma\) starts at 1.0 and decays linearly to 0.1 over 90 episodes (sigma = 1 - 0.9 * i_episode / tol_eps), so the agent explores broadly early and exploits later.

18.7.2 The PPO Update

The update_parameters method implements the clipped surrogate objective:

# Recompute MPC action under current parameters
opt_states, opt_actions = self.forward(states, f, C, c, current=True)
log_probs, _ = self.evaluate_action(opt_actions[0], actions, sigma)

# PPO clipped surrogate
ratio = exp(log_probs - old_log_probs)
surr1 = ratio * advantage
surr2 = clamp(ratio, 1 - epsilon, 1 + epsilon) * advantage
loss  = -min(surr1, surr2).mean()

loss.backward()  # gradients flow through MPC to F_hat, Bd_hat
optimizer.step()

The critical point: loss.backward() differentiates through the MPC solver, so F_hat and Bd_hat are updated to maximize expected reward, not just to minimize prediction error.

18.7.3 Environment Interaction Loop

The main() function contains the EnergyPlus-specific interaction loop. For Assignment 3, this is the main part you replace:

Gnu-RL (EnergyPlus)	Your code (SustainDC)
`env = gym.make('5Zone-control_TMY3-v0')`	`env = gym.make('sustaindc/SustainDC-v0', ...)`
`timeStep, obs, isTerminal = env.reset()`	`obs, info = env.reset()`
`timeStep, obs, isTerminal = env.step([SAT_stpt])`	`obs, reward, terminated, truncated, info = env.step(action)`
Disturbances from pre-loaded weather CSV	Disturbances from SustainDC’s observation space
Reward via `R_func(obs_dict, action, eta)`	Reward from `env.step()` or custom reward function
Episode = 1 natural day (96 steps @ 15 min)	Episode defined by SustainDC config

The PPO class itself (the MPC call, action selection, parameter update) is largely reusable — it’s the environment interaction in main() that needs rewriting.

18.8 Hands-On: End-to-End Example on `ssM_env`

This section walks through the complete Gnu-RL pipeline on ssM_env — a lightweight linear system that requires no EnergyPlus. The pattern here is exactly what you will replicate with SustainDC in Assignment 3.

18.8.1 Step 1: Define the Environment and Collect Expert Data

import numpy as np
import torch
import torch.optim as optim
import matplotlib.pyplot as plt
from ssM_env.env import ssModel
from diff_mpc.mpc import MPC, QuadCost, LinDx

# --- Environment: single-zone thermal model ---
A  = np.array([[0.95]])          # thermal inertia
Bu = np.array([[0.05]])          # heating effectiveness
Bd = np.array([[0.02, 0.01]])    # weather sensitivity [outdoor_temp, solar]
C_obs = np.array([[1.0]])        # observe temperature directly

# 1 day of disturbance data (96 steps @ 15 min)
n_steps = 96
hours = np.arange(n_steps) / 4
outdoor_temp = 5 + 10 * np.sin(2 * np.pi * (hours - 6) / 24)
solar = np.maximum(0, 3 * np.sin(2 * np.pi * (hours - 6) / 24))
d = np.stack([outdoor_temp, solar])  # [2, 96]

ssMatrix = {'A': A, 'Bu': Bu, 'Bd': Bd, 'C': C_obs, 'd': d}
env = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)

# --- Collect expert data with a proportional controller ---
Kp = 2.0
expert_states, expert_actions, expert_disturbances = [], [], []

obs = env.reset()
for t in range(n_steps - 1):
    temp = obs[0]
    action = np.array([Kp * max(0, 22 - temp)])  # P-controller toward setpoint
    expert_states.append(temp)
    expert_actions.append(action[0])
    expert_disturbances.append(d[:, t])
    obs, reward = env.step(action)

expert_states.append(obs[0])  # final state
print(f"Collected {len(expert_actions)} expert transitions")

18.8.2 Step 2: Learn Dynamics from Expert Data

We initialize F_hat and Bd_hat with wrong values, then learn them by minimizing the one-step state prediction error \(\|F[x_t; u_t] + B_d d_t - x_{t+1}\|^2\). This is the state prediction component of Gnu-RL’s imitation loss (the action matching component is added in the full pipeline with more data).

n_state, n_ctrl, n_dist = 1, 1, 2

# Vectorize training data
X = torch.tensor(
    np.column_stack([expert_states[:-1], expert_actions]), dtype=torch.float64)  # [N, 2]
D = torch.tensor(d[:, :len(expert_actions)].T, dtype=torch.float64)             # [N, 2]
Y = torch.tensor(expert_states[1:], dtype=torch.float64)                        # [N]

# --- Learnable parameters (initialized wrong) ---
F_hat  = torch.tensor([[0.5, 0.2]], dtype=torch.float64, requires_grad=True)   # true: [0.95, 0.05]
Bd_hat = torch.tensor([[0.1, 0.1]], dtype=torch.float64, requires_grad=True)   # true: [0.02, 0.01]
optimizer = optim.Adam([F_hat, Bd_hat], lr=0.05)

losses = []
for epoch in range(500):
    optimizer.zero_grad()
    pred = (X @ F_hat.T + D @ Bd_hat.T).squeeze()  # vectorized one-step prediction
    loss = ((pred - Y)**2).mean()
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1:3d}  loss={losses[-1]:.6f}  "
              f"F_hat={F_hat.detach().numpy().round(3)}  "
              f"Bd_hat={Bd_hat.detach().numpy().round(3)}")

print(f"\nTrue F:  [0.95, 0.05]   Learned: {F_hat.detach().numpy().round(3)}")
print(f"True Bd: [0.02, 0.01]   Learned: {Bd_hat.detach().numpy().round(3)}")

plt.plot(losses)
plt.xlabel("Epoch")
plt.ylabel("State Prediction Loss")
plt.title("Learning Dynamics Parameters")
plt.show()

You should see F_hat converge toward [0.95, 0.05] and Bd_hat toward [0.02, 0.01].

18.8.3 Step 3: Run the MPC with Learned Dynamics

Now verify that the learned dynamics produce a useful MPC controller — this closes the loop between learning and control:

T = 12
C_mpc = torch.zeros(T, 1, n_state + n_ctrl, n_state + n_ctrl, dtype=torch.float64)
C_mpc[:, :, 0, 0] = 1.0      # comfort weight
C_mpc[:, :, 1, 1] = 0.001    # energy weight
c_mpc = torch.zeros(T, 1, n_state + n_ctrl, dtype=torch.float64)
c_mpc[:, :, 0] = -22.0       # target = 22

env2 = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)
obs = env2.reset()
mpc_temps, mpc_actions = [obs[0]], []

with torch.no_grad():
    for t in range(n_steps - 1):
        x_init = torch.tensor([[obs[0]]], dtype=torch.float64)
        dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
        if dt.shape[1] < T - 1:
            dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
        ft = torch.mm(Bd_hat.detach(), dt).T.unsqueeze(1)

        x_opt, u_opt, _ = MPC(
            n_state, n_ctrl, T,
            u_lower=0.0, u_upper=50.0,
            lqr_iter=20, verbose=-1,
            exit_unconverged=False,
        )(x_init, QuadCost(C_mpc, c_mpc),
          LinDx(F_hat.detach().repeat(T-1, 1, 1, 1), ft))

        action = np.array([max(0, u_opt[0, 0, 0].item())])
        mpc_actions.append(action[0])
        obs, reward = env2.step(action)
        mpc_temps.append(obs[0])

hours = np.arange(len(mpc_temps)) / 4
fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)
axes[0].plot(hours, mpc_temps, label="MPC (learned dynamics)")
axes[0].axhline(y=22, color='k', linestyle='--', label="Setpoint")
axes[0].set_ylabel("Temperature")
axes[0].legend()
axes[1].step(hours[:-1], mpc_actions, where='post')
axes[1].set_ylabel("Control Action")
axes[1].set_xlabel("Hours")
plt.tight_layout()
plt.show()

18.8.4 Step 4: Online Learning

Steps 2–3 used a two-stage approach: learn dynamics offline from expert data, then deploy with fixed parameters. But the deployed controller visits different states than the expert did, so the learned dynamics may be inaccurate in the regions that matter most. Online learning refines \(F\) and \(B_d\) from the controller’s own experience.

We show two approaches that mirror the paper’s two phases:

4a — Online dynamics refinement: update \(F\), \(B_d\) to minimize prediction error on transitions the controller actually generates. Simple, but optimizes for accuracy, not control.
4b — PPO through the MPC: update \(F\), \(B_d\) via policy gradient to maximize reward. This is Gnu-RL’s key contribution — the dynamics converge to values that produce good control, even if they don’t match the true system exactly.

18.8.4.1 Gradient flow check

Before doing online learning, verify that gradients actually flow through the MPC solver back to the dynamics parameters:

F_test = F_hat.detach().clone().requires_grad_(True)
Bd_test = Bd_hat.detach().clone().requires_grad_(True)

x_init = torch.tensor([[21.0]], dtype=torch.float64)
dt = torch.tensor(d[:, 0:T-1], dtype=torch.float64)
ft = torch.mm(Bd_test, dt).T.unsqueeze(1)

x_opt, u_opt, _ = MPC(
    n_state, n_ctrl, T,
    lqr_iter=20, verbose=-1,
    exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc), LinDx(F_test.repeat(T-1, 1, 1, 1), ft))

target_loss = (x_opt[1, 0, 0] - 22.0)**2
target_loss.backward()

print(f"dL/dF  = {F_test.grad.numpy().round(4)}")
print(f"dL/dBd = {Bd_test.grad.numpy().round(4)}")
print("Non-zero gradients confirm: the MPC solver is differentiable!")

18.8.4.2 4a: Online dynamics refinement

The simplest approach: run the MPC controller for one episode, collect the transitions \((x_t, u_t, d_t, x_{t+1})\), then do a batch gradient step on the prediction error \(\|\hat{x}_{t+1} - x_{t+1}\|^2\). Repeat for several episodes — each episode uses improved dynamics, visits better states, and generates better training data.

# Start from Step 2's learned dynamics
F_online  = F_hat.detach().clone().requires_grad_(True)
Bd_online = Bd_hat.detach().clone().requires_grad_(True)
opt_online = optim.Adam([F_online, Bd_online], lr=0.01)

n_episodes = 10
online_rewards = []

for ep in range(n_episodes):
    env_ep = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)
    obs = env_ep.reset()

    # --- Rollout: control with current dynamics, collect data ---
    ep_states, ep_actions, ep_dist, ep_next = [], [], [], []
    ep_reward = []
    for t in range(n_steps - 1):
        x_t = obs[0]
        x_init = torch.tensor([[x_t]], dtype=torch.float64)
        dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
        if dt.shape[1] < T - 1:
            dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
        ft = torch.mm(Bd_online.detach(), dt).T.unsqueeze(1)

        with torch.no_grad():
            _, u_opt, _ = MPC(n_state, n_ctrl, T,
                u_lower=0.0, u_upper=50.0, lqr_iter=20, verbose=-1,
                exit_unconverged=False,
            )(x_init, QuadCost(C_mpc, c_mpc),
              LinDx(F_online.detach().repeat(T-1, 1, 1, 1), ft))

        action = max(0, u_opt[0, 0, 0].item())
        ep_states.append(x_t)
        ep_actions.append(action)
        ep_dist.append(d[:, t])
        obs, reward = env_ep.step(np.array([action]))
        ep_next.append(obs[0])
        ep_reward.append(reward)

    # --- Batch update on collected episode data ---
    X = torch.tensor(np.column_stack([ep_states, ep_actions]), dtype=torch.float64)
    D = torch.tensor(np.array(ep_dist), dtype=torch.float64)
    Y = torch.tensor(ep_next, dtype=torch.float64)

    opt_online.zero_grad()
    pred = (X @ F_online.T + D @ Bd_online.T).squeeze()
    loss = ((pred - Y)**2).mean()
    loss.backward()
    opt_online.step()

    avg_r = np.mean(ep_reward)
    online_rewards.append(avg_r)
    print(f"Episode {ep+1}: avg_reward={avg_r:.2f}  pred_loss={loss.item():.4f}  "
          f"F={F_online.detach().numpy().round(4)}  "
          f"Bd={Bd_online.detach().numpy().round(4)}")

You should see both the prediction loss and average reward improve over episodes as the dynamics become more accurate in the regions the controller actually visits.

18.8.4.3 4b: PPO through the Differentiable MPC

The PPO approach is more powerful: instead of minimizing prediction error, it directly maximizes the environment reward by backpropagating policy gradients through the MPC solver. This is the core of Gnu-RL’s Phase 2.

The key steps per episode:

Rollout: run the MPC, wrap its output \(u^*\) in a Gaussian \(\mathcal{N}(u^*, \sigma)\) for exploration, record \((s, a, r, \log\pi_{old})\)
Advantages: normalize rewards as a simple baseline
PPO update: re-run the MPC (with gradients), compute the new log-probability of each recorded action, and apply the clipped surrogate objective

# Start fresh from Step 2's learned dynamics
F_ppo  = F_hat.detach().clone().requires_grad_(True)
Bd_ppo = Bd_hat.detach().clone().requires_grad_(True)
opt_ppo = optim.Adam([F_ppo, Bd_ppo], lr=0.003)

n_episodes = 10
sigma_start, sigma_end = 0.5, 0.05  # exploration noise (decays)
eps_clip = 0.2
ppo_rewards = []

for ep in range(n_episodes):
    sigma = sigma_start - (sigma_start - sigma_end) * ep / max(n_episodes - 1, 1)
    env_ep = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)
    obs = env_ep.reset()

    # --- Rollout with stochastic policy ---
    rollout = []  # list of (x_t, timestep, action, log_prob_old, reward)
    for t in range(n_steps - 1):
        x_t = obs[0]
        x_init = torch.tensor([[x_t]], dtype=torch.float64)
        dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
        if dt.shape[1] < T - 1:
            dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
        ft = torch.mm(Bd_ppo.detach(), dt).T.unsqueeze(1)

        with torch.no_grad():
            _, u_opt, _ = MPC(n_state, n_ctrl, T,
                u_lower=0.0, u_upper=50.0, lqr_iter=20, verbose=-1,
                exit_unconverged=False,
            )(x_init, QuadCost(C_mpc, c_mpc),
              LinDx(F_ppo.detach().repeat(T-1, 1, 1, 1), ft))

        mu = u_opt[0, 0, 0].item()
        dist = torch.distributions.Normal(mu, sigma)
        a_sampled = dist.sample()
        action = max(0.0, a_sampled.item())
        log_prob_old = dist.log_prob(torch.tensor(action)).item()

        obs, reward = env_ep.step(np.array([action]))
        rollout.append((x_t, t, action, log_prob_old, reward))

    rewards_arr = np.array([r[4] for r in rollout])
    advantages = (rewards_arr - rewards_arr.mean()) / (rewards_arr.std() + 1e-8)

    # --- PPO update: re-run MPC with gradients ---
    ppo_loss = torch.tensor(0.0, dtype=torch.float64)
    for idx, (x_t, t, action, lp_old, _) in enumerate(rollout):
        x_init = torch.tensor([[x_t]], dtype=torch.float64)
        dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
        if dt.shape[1] < T - 1:
            dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
        ft = torch.mm(Bd_ppo, dt).T.unsqueeze(1)  # gradients flow here

        _, u_opt, _ = MPC(n_state, n_ctrl, T,
            u_lower=0.0, u_upper=50.0, lqr_iter=20, verbose=-1,
            exit_unconverged=False,
        )(x_init, QuadCost(C_mpc, c_mpc),
          LinDx(F_ppo.repeat(T-1, 1, 1, 1), ft))  # and here

        mu_new = u_opt[0, 0, 0]
        dist_new = torch.distributions.Normal(mu_new, sigma)
        log_prob_new = dist_new.log_prob(torch.tensor(action))

        ratio = torch.exp(log_prob_new - lp_old)
        adv = torch.tensor(advantages[idx], dtype=torch.float64)
        surr1 = ratio * adv
        surr2 = torch.clamp(ratio, 1 - eps_clip, 1 + eps_clip) * adv
        ppo_loss = ppo_loss - torch.min(surr1, surr2)

    ppo_loss = ppo_loss / len(rollout)
    opt_ppo.zero_grad()
    ppo_loss.backward()
    opt_ppo.step()

    avg_r = rewards_arr.mean()
    ppo_rewards.append(avg_r)
    print(f"Episode {ep+1:2d}/{n_episodes}  sigma={sigma:.2f}  "
          f"avg_reward={avg_r:.2f}  "
          f"F={F_ppo.detach().numpy().round(4)}  "
          f"Bd={Bd_ppo.detach().numpy().round(4)}")

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(range(1, len(online_rewards)+1), online_rewards, 'o-', label="Online dynamics (4a)")
ax.plot(range(1, len(ppo_rewards)+1), ppo_rewards, 's-', label="PPO (4b)")
ax.set_xlabel("Episode")
ax.set_ylabel("Average Reward")
ax.set_title("Online Learning: Prediction-Based vs. Reward-Based")
ax.legend()
plt.tight_layout()
plt.show()

Note

Both approaches improve the controller, but they optimize for different things. The dynamics refinement (4a) makes the model more accurate at predicting states — which indirectly improves control. PPO (4b) makes the model produce better actions — the dynamics may not match the true system exactly, but they converge to values that maximize reward. This is the paper’s central claim: optimizing dynamics for control performance (through the differentiable MPC) outperforms pure system identification.

18.9 Gnu-RL on BOPTEST

The ssM_env example above uses a toy linear system where we know the true dynamics. Before jumping to SustainDC (a data center), let’s apply Gnu-RL to a realistic building via the BOPTEST framework you used in Lecture 9. This is a natural stepping stone: BOPTEST is a building environment with continuous actions — exactly what Gnu-RL was designed for.

18.9.1 The Mapping

For the bestest_hydronic_heat_pump test case:

Gnu-RL concept	BOPTEST equivalent	Variable
State \(x\)	Zone temperature	`reaTZon_y` (in K)
Action \(u\)	Heat pump modulation	`oveHeaPumY_u` (\(\in [0, 1]\))
Disturbance \(d\)	Outdoor temp, solar radiation	`TDryBul`, `HGloHor` via `/forecast`
Target	Heating setpoint	`reaTSetHea_y` (in K)
Step size	15 min (900 s)	Same as Gnu-RL’s original setup

18.9.2 Step 1: Collect Expert Data from BOPTEST

We run a simple proportional controller for 2 days and record \((x, u, d)\) tuples — the same data collection step as agent/simulate.py in Gnu-RL:

import requests
import numpy as np
import torch

url = 'http://13.217.71.86'
testcase = 'bestest_hydronic_heat_pump'

# Select and initialize
testid = requests.post(f'{url}/testcases/{testcase}/select').json()['testid']
requests.put(f'{url}/initialize/{testid}', json={'start_time': 0, 'warmup_period': 0})
requests.put(f'{url}/step/{testid}', json={'step': 900})  # 15-min steps

# --- Collect expert data with a proportional controller ---
K_p = 2.0
T_set_C = 21.0
T_set_K = 273.15 + T_set_C
n_steps = 2 * 96  # 2 days

states, actions, disturbances = [], [], []
y = requests.post(f'{url}/advance/{testid}', json={}).json()['payload']

for t in range(n_steps):
    T_zone = y['reaTZon_y']
    T_outdoor = y['weaSta_reaWeaTDryBul_y']
    solar = y['weaSta_reaWeaHGloHor_y']

    # Proportional controller (expert)
    error = T_set_K - T_zone
    u_heat = max(0.0, min(K_p * error, 1.0))

    states.append(T_zone)
    actions.append(u_heat)
    disturbances.append([T_outdoor, solar])

    y = requests.post(f'{url}/advance/{testid}', json={
        'oveHeaPumY_u': u_heat,
        'oveHeaPumY_activate': 1
    }).json()['payload']

states.append(y['reaTZon_y'])  # final state

states = np.array(states)
actions = np.array(actions)
disturbances = np.array(disturbances)  # [n_steps, 2]
print(f"Collected {len(actions)} transitions, temp range: "
      f"{(states.min()-273.15):.1f}--{(states.max()-273.15):.1f} °C")

18.9.3 Step 2: Learn Dynamics from Expert Data

Same approach as the ssM_env example — learn \(F\) and \(B_d\) by minimizing one-step state prediction error on the collected data:

from diff_mpc.mpc import MPC, QuadCost, LinDx

n_state, n_ctrl, n_dist = 1, 1, 2
T = 5  # planning horizon

# Normalize disturbances for numerical stability
d_mean = disturbances.mean(axis=0)
d_std  = disturbances.std(axis=0) + 1e-8
d_norm = (disturbances - d_mean) / d_std

# Vectorize training data
X = torch.tensor(np.column_stack([states[:-1], actions]), dtype=torch.float64)  # [N, 2]
D = torch.tensor(d_norm, dtype=torch.float64)                                   # [N, 2]
Y = torch.tensor(states[1:], dtype=torch.float64)                               # [N]

# Learnable parameters (initialized wrong)
F_hat  = torch.tensor([[0.5, 0.3]], dtype=torch.float64, requires_grad=True)
Bd_hat = torch.tensor([[0.1, 0.1]], dtype=torch.float64, requires_grad=True)
optimizer = torch.optim.Adam([F_hat, Bd_hat], lr=0.05)

losses = []
for epoch in range(300):
    optimizer.zero_grad()
    pred = (X @ F_hat.T + D @ Bd_hat.T).squeeze()
    loss = ((pred - Y)**2).mean()
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch+1:3d}  loss={losses[-1]:.6f}  "
              f"F={F_hat.detach().numpy().round(4)}  "
              f"Bd={Bd_hat.detach().numpy().round(4)}")

18.9.4 Step 3: Deploy the Learned Controller

After training, use the learned dynamics to run the MPC as a controller on a fresh BOPTEST episode:

import matplotlib.pyplot as plt

# Cost matrices for MPC
C_mpc = torch.zeros(T, 1, n_state + n_ctrl, n_state + n_ctrl, dtype=torch.float64)
C_mpc[:, :, 0, 0] = 1.0    # comfort weight
C_mpc[:, :, 1, 1] = 0.01   # energy weight
c_mpc = torch.zeros(T, 1, n_state + n_ctrl, dtype=torch.float64)
c_mpc[:, :, 0] = -T_set_K  # target setpoint

# Fresh test instance
testid2 = requests.post(f'{url}/testcases/{testcase}/select').json()['testid']
requests.put(f'{url}/initialize/{testid2}', json={'start_time': 0, 'warmup_period': 0})
requests.put(f'{url}/step/{testid2}', json={'step': 900})

mpc_temps, mpc_actions = [], []
y = requests.post(f'{url}/advance/{testid2}', json={}).json()['payload']

for t in range(96):  # 1 day
    T_zone = y['reaTZon_y']

    # Get disturbance forecast for the planning horizon
    fc = requests.put(f'{url}/forecast/{testid2}', json={
        'point_names': ['TDryBul', 'HGloHor'],
        'horizon': (T - 1) * 900,
        'interval': 900
    }).json()['payload']
    d_fc = np.column_stack([fc['TDryBul'], fc['HGloHor']])  # forecast returns lists
    if len(d_fc) < T - 1:
        d_fc = np.pad(d_fc, ((0, T - 1 - len(d_fc)), (0, 0)), 'edge')
    d_fc_norm = (d_fc[:T-1] - d_mean) / d_std

    # Run MPC with learned dynamics
    x_init = torch.tensor([[T_zone]], dtype=torch.float64)
    dt = torch.tensor(d_fc_norm.T, dtype=torch.float64)
    ft = torch.mm(Bd_hat.detach(), dt).T.unsqueeze(1)

    with torch.no_grad():
        x_opt, u_opt, _ = MPC(
            n_state, n_ctrl, T,
            u_lower=0.0, u_upper=1.0,
            lqr_iter=20, verbose=-1,
            exit_unconverged=False,
        )(x_init, QuadCost(C_mpc, c_mpc),
          LinDx(F_hat.detach().repeat(T-1, 1, 1, 1), ft))

    u_mpc = u_opt[0, 0, 0].item()
    mpc_temps.append(T_zone - 273.15)
    mpc_actions.append(u_mpc)

    y = requests.post(f'{url}/advance/{testid2}', json={
        'oveHeaPumY_u': u_mpc,
        'oveHeaPumY_activate': 1
    }).json()['payload']

mpc_temps.append(y['reaTZon_y'] - 273.15)
kpis = requests.get(f'{url}/kpi/{testid2}').json()['payload']

# Plot
hours = np.arange(len(mpc_temps)) / 4
fig, axes = plt.subplots(2, 1, figsize=(14, 5), sharex=True)
axes[0].plot(hours, mpc_temps, label="Gnu-RL (learned MPC)")
axes[0].axhline(y=T_set_C, color='k', linestyle='--', label="Setpoint")
axes[0].set_ylabel("Zone Temp (°C)")
axes[0].legend()

axes[1].step(hours[:-1], mpc_actions, where='post')
axes[1].set_ylabel("Heat Pump Signal")
axes[1].set_xlabel("Hours")
plt.tight_layout()
plt.show()

print(f"Energy: {kpis['ener_tot']:.2f} kWh, Discomfort: {kpis['tdis_tot']:.4f} Kh")

18.9.5 Step 4: Online PPO Learning

Step 3 deployed the MPC with fixed dynamics. Now we close the loop: run a single continuous simulation over several days, and at the end of each day use that day’s trajectory to update \(F\) and \(B_d\) via PPO. This mirrors how Gnu-RL would operate in a real building — the controller runs continuously and refines its model overnight.

Each day has two phases:

Rollout (96 steps, interacts with BOPTEST): run the MPC, add Gaussian noise for exploration, record \((x_t, d_t, a_t, \log\pi_{old}, r_t)\)
PPO update (local only, no API calls): re-run the MPC with gradients on the stored inputs, compute the clipped surrogate loss, backpropagate through the MPC to update \(F\) and \(B_d\)

import torch.optim as optim

F_ppo  = F_hat.detach().clone().requires_grad_(True)
Bd_ppo = Bd_hat.detach().clone().requires_grad_(True)
opt_ppo = optim.Adam([F_ppo, Bd_ppo], lr=0.001)

n_days = 5
sigma_start, sigma_end = 0.3, 0.05  # exploration noise (decays)
eps_clip = 0.2

def boptest_reward(T_zone_K, u):
    """Reward: penalize deviation from setpoint + energy use."""
    return -(5.0 * (T_zone_K - T_set_K)**2 + u**2)

# Single continuous simulation (no reset between days)
tid = requests.post(f'{url}/testcases/{testcase}/select').json()['testid']
requests.put(f'{url}/initialize/{tid}', json={'start_time': 0, 'warmup_period': 0})
requests.put(f'{url}/step/{tid}', json={'step': 900})
y = requests.post(f'{url}/advance/{tid}', json={}).json()['payload']

ppo_rewards = []
for day in range(n_days):
    sigma = sigma_start - (sigma_start - sigma_end) * day / max(n_days - 1, 1)

    # --- Phase 1: Rollout for one day (continues from where we left off) ---
    rollout = []
    for t in range(96):
        T_zone = y['reaTZon_y']

        fc = requests.put(f'{url}/forecast/{tid}', json={
            'point_names': ['TDryBul', 'HGloHor'],
            'horizon': (T - 1) * 900, 'interval': 900
        }).json()['payload']
        d_fc = np.column_stack([fc['TDryBul'], fc['HGloHor']])
        if len(d_fc) < T - 1:
            d_fc = np.pad(d_fc, ((0, T - 1 - len(d_fc)), (0, 0)), 'edge')
        d_fc_norm = (d_fc[:T-1] - d_mean) / d_std

        # MPC action (no grad during rollout)
        x_init = torch.tensor([[T_zone]], dtype=torch.float64)
        dt = torch.tensor(d_fc_norm.T, dtype=torch.float64)
        ft = torch.mm(Bd_ppo.detach(), dt).T.unsqueeze(1)
        with torch.no_grad():
            _, u_opt, _ = MPC(n_state, n_ctrl, T,
                u_lower=0.0, u_upper=1.0, lqr_iter=20, verbose=-1,
                exit_unconverged=False,
            )(x_init, QuadCost(C_mpc, c_mpc),
              LinDx(F_ppo.detach().repeat(T-1, 1, 1, 1), ft))

        # Stochastic policy: Gaussian around MPC output
        mu = u_opt[0, 0, 0].item()
        dist = torch.distributions.Normal(mu, sigma)
        action = float(max(0.0, min(1.0, dist.sample().item())))
        log_prob_old = dist.log_prob(torch.tensor(action)).item()

        y = requests.post(f'{url}/advance/{tid}', json={
            'oveHeaPumY_u': action, 'oveHeaPumY_activate': 1
        }).json()['payload']

        rollout.append((T_zone, d_fc_norm.copy(), action, log_prob_old,
                        boptest_reward(T_zone, action)))

    rewards_arr = np.array([r[4] for r in rollout])
    advantages = (rewards_arr - rewards_arr.mean()) / (rewards_arr.std() + 1e-8)

    # --- Phase 2: PPO update (local only, no API calls) ---
    ppo_loss = torch.tensor(0.0, dtype=torch.float64)
    for idx, (T_zone, d_fc_n, action, lp_old, _) in enumerate(rollout):
        x_init = torch.tensor([[T_zone]], dtype=torch.float64)
        dt = torch.tensor(d_fc_n.T, dtype=torch.float64)
        ft = torch.mm(Bd_ppo, dt).T.unsqueeze(1)        # ← gradients flow
        _, u_opt, _ = MPC(n_state, n_ctrl, T,
            u_lower=0.0, u_upper=1.0, lqr_iter=20, verbose=-1,
            exit_unconverged=False,
        )(x_init, QuadCost(C_mpc, c_mpc),
          LinDx(F_ppo.repeat(T-1, 1, 1, 1), ft))        # ← gradients flow

        mu_new = u_opt[0, 0, 0]
        dist_new = torch.distributions.Normal(mu_new, sigma)
        log_prob_new = dist_new.log_prob(torch.tensor(action))

        ratio = torch.exp(log_prob_new - lp_old)
        adv = torch.tensor(advantages[idx], dtype=torch.float64)
        surr1 = ratio * adv
        surr2 = torch.clamp(ratio, 1 - eps_clip, 1 + eps_clip) * adv
        ppo_loss = ppo_loss - torch.min(surr1, surr2)

    ppo_loss = ppo_loss / len(rollout)
    opt_ppo.zero_grad()
    ppo_loss.backward()
    opt_ppo.step()

    ppo_rewards.append(rewards_arr.mean())
    print(f"Day {day+1}/{n_days}  sigma={sigma:.2f}  "
          f"avg_reward={rewards_arr.mean():.2f}  "
          f"F={F_ppo.detach().numpy().round(4)}  "
          f"Bd={Bd_ppo.detach().numpy().round(4)}")

kpis = requests.get(f'{url}/kpi/{tid}').json()['payload']
print(f"\nCumulative KPIs over {n_days} days: "
      f"energy={kpis['ener_tot']:.2f} kWh, discomfort={kpis['tdis_tot']:.4f} Kh")

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(range(1, len(ppo_rewards) + 1), ppo_rewards, 's-')
ax.set_xlabel("Day")
ax.set_ylabel("Average Reward")
ax.set_title("BOPTEST PPO: Average Reward per Day")
ax.axhline(y=ppo_rewards[0], color='k', linestyle='--', alpha=0.4, label="Day 1 baseline")
ax.legend()
plt.tight_layout()
plt.show()

The simulation runs continuously — each day picks up where the previous day ended, so the controller experiences changing weather conditions across days. The PPO update at the end of each day adjusts \(F\) and \(B_d\) to improve control for the conditions the agent actually encounters.

The progression is now: toy system (ssM_env) \(\to\) realistic building (BOPTEST) \(\to\) data center (SustainDC). The code structure is identical at each step — only the environment interface changes.

18.10 Connecting to Assignment 3

The environment interface mapping and the keep-vs-replace breakdown are already covered in sections above. This section addresses the two non-obvious design decisions you will face when adapting Gnu-RL for SustainDC.

18.10.1 State vs. Disturbance Decomposition

SustainDC’s Agent_DC (cooling optimizer) observes 26+ dimensions. You need to decide which are states (\(x\), what the controller affects), which are disturbances (\(d\), exogenous inputs), and which to ignore:

Role	Candidate variables	Notes
State \(x\)	Room temperature	The variable you are trying to control — analogous to “Indoor Temp.” in Gnu-RL
Disturbance \(d\)	Ambient temperature, IT power/workload, carbon intensity forecast	Things the cooling controller cannot change but needs to anticipate
Ignore (initially)	Battery SoC, workload scheduling variables	Belong to the other two agents; not relevant if you fix them to baselines

This gives you a small state-space model (\(n_{state} = 1\), \(n_{dist} = 3\text{--}5\)) similar in size to Gnu-RL’s original formulation. You can always add more state dimensions later if the model is too simple.

18.10.2 Continuous vs. Discrete Actions

Gnu-RL’s MPC produces continuous control actions (e.g., supply air temperature in \([20, 65]°C\)). SustainDC’s Agent_DC uses a Discrete(3) action space: decrease / maintain / increase CRAH setpoint.

You have two options:

Discretize the MPC output: Run the MPC with continuous bounds, then map the result to the closest discrete action (e.g., \(u^* > \text{threshold} \to\) increase, \(u^* < -\text{threshold} \to\) decrease, else maintain). Simple, but the discretization is not differentiable.
Bypass the discrete wrapper: Access SustainDC’s underlying continuous setpoint directly (the discrete actions just increment/decrement an internal setpoint variable). This preserves differentiability but requires modifying how you interact with the environment.

Either approach is acceptable for the assignment. Document your choice and its trade-offs.

18.10.3 Checklist

Before starting Assignment 3, confirm you can:

Run the ssM_env imitation learning example from this tutorial (Section 8) and see F_hat converge
Instantiate a SustainDC environment with a simplified configuration (from the SustainDC tutorial)
Run a baseline controller in SustainDC and collect \((x, u, d)\) trajectories
Define the shapes of F_hat and Bd_hat for your chosen state/disturbance decomposition
Construct C and c matrices that encode your cost function