import torch
from diff_mpc.mpc import MPC, QuadCost, LinDx
# 1D system: x_{t+1} = 0.9*x + 0.1*u, drive state toward 24
n_state, n_ctrl, T = 1, 1, 5
F = torch.zeros(T-1, 1, n_state, n_state + n_ctrl)
F[:, :, 0, 0] = 0.9 # state dynamics
F[:, :, 0, 1] = 0.1 # control influence
f = torch.zeros(T-1, 1, n_state)
C = torch.zeros(T, 1, n_state + n_ctrl, n_state + n_ctrl)
C[:, :, 0, 0] = 1.0 # penalize state deviation
C[:, :, 1, 1] = 0.01 # penalize control effort
c = torch.zeros(T, 1, n_state + n_ctrl)
c[:, :, 0] = -24.0 # target state = 24
x_init = torch.tensor([[20.0]])
solver = MPC(n_state, n_ctrl, T,
u_lower=0.0, u_upper=50.0,
lqr_iter=20, verbose=0,
exit_unconverged=False)
x, u, costs = solver(x_init, QuadCost(C, c), LinDx(F, f))
print(f"Initial state: {x_init[0, 0]:.1f}")
print(f"Planned states: {x[:, 0, 0].detach().numpy()}")
print(f"Planned controls: {u[:, 0, 0].detach().numpy()}")18 Hands-On: Getting Started with Gnu-RL
18.1 Overview
The code in this notebook requires the Gnu-RL repository and its dependencies. It is not executed during the book build. Follow along by running each cell in your own Python environment.
Important: The original Gnu-RL code requires EnergyPlus (a building energy simulation engine) for its full training pipeline. We will not use EnergyPlus in this tutorial. Instead, we work with:
- The Differentiable MPC module (
diff_mpc/) — pure PyTorch, no external dependencies - A simplified state-space environment (
ssM_env/) — a lightweight linear system simulator - Pre-computed results from the original authors’ experiments
The original codebase was written for PyTorch 1.0 (2018). A pytorch2-compat branch with all necessary patches is provided so you don’t need to modify anything manually.
18.2 Environment Setup
18.2.1 Cloning the Repository
Clone the pytorch2-compat branch, which contains the original code with compatibility patches for modern PyTorch (\(\geq\) 2.0):
git clone -b pytorch2-compat https://github.com/INFERLab/Gnu-RL.git gnu-rl
cd gnu-rlpytorch2-compat branch?
The diff_mpc/ module required four categories of updates to work with PyTorch 2.x:
Autograd Function API (
lqr_step.py): PyTorch 2.x requirestorch.autograd.Functionsubclasses to use@staticmethodforforwardandbackward. The old instance-method style was removed. The patched code splitsLQRStepinto a wrapper class (holds config) and_LQRStepFunction(static autograd methods that receive config viactx).Removed linear algebra ops (
pnqp.py,lqr_step.py):torch.btrifact()andtorch.btrisolve()were removed in favor oftorch.linalg.lu_factor()andtorch.linalg.lu_solve().Boolean tensor handling (
util.py,pnqp.py,lqr_step.py):.byte()masks replaced with propertorch.booltensors; arithmetic negation1 - maskreplaced with bitwise~mask.Variable wrapping (
mpc.py):torch.autograd.Variablewas a no-op wrapper since PyTorch 0.4 and has been fully removed. AllVariable(...)calls replaced with direct tensor operations.
None of these changes affect the mathematical behavior of the solver — they are purely API migrations.
18.2.2 Installing Dependencies
Initialize a uv project and install the required packages:
uv init --python 3.10
uv add torch numpy pandas matplotlib requests18.2.3 Verifying the Setup
A quick sanity check — import the MPC solver and run a trivial optimization:
You should see the MPC driving the state from 20 toward 24 by applying large control actions early, then tapering off.
18.3 Repository Walkthrough
The repository has three main parts: the environment-agnostic MPC solver (diff_mpc/), the environment-specific training scripts (agent/), and a lightweight test environment (ssM_env/). For Assignment 3, you will reuse diff_mpc/ as-is and adapt the training logic from agent/ to work with SustainDC instead of EnergyPlus.
18.3.1 Key Files
gnu-rl/
├── diff_mpc/ # Differentiable MPC solver (pure PyTorch) ← KEEP AS-IS
│ ├── mpc.py # MPC class: orchestrates iLQR iterations
│ ├── lqr_step.py # Single LQR backward/forward sweep (autograd Function)
│ ├── pnqp.py # Box-constrained QP solver for control bounds
│ ├── dynamics.py # Dynamics model wrappers (LinDx, etc.)
│ └── util.py # Batch matrix utilities (bmv, bger, bdiag)
│
├── ssM_env/ # Simplified state-space environment (no EnergyPlus)
│ └── env.py # Linear system: x_{t+1} = Ax + Bu*u + Bd*d
│
├── agent/ # Training scripts ← ADAPT FOR SUSTAINDC
│ ├── simulate.py # Step 0: Run baseline controller, collect data
│ ├── Imit_EP.py # Step 1: Offline imitation learning (Learner class)
│ ├── PPO_MPC_EP.py # Step 2: Online PPO refinement
│ ├── utils.py # Reward function, data utilities
│ └── results/ # Pre-computed results from the authors
│
├── eplus_env/ # EnergyPlus Gym wrapper (not used)
└── Demo.ipynb # Results visualization notebook
18.3.2 The Two-Phase Pipeline in Code
The two training phases from the paper map directly to files:
| Phase | File | What it does |
|---|---|---|
| 0. Data collection | agent/simulate.py |
Runs the existing controller in EnergyPlus, logs \((x, u, d)\) trajectories |
| 1. Imitation learning | agent/Imit_EP.py |
Learner class with learnable \(F\), \(B_d\); trains by backpropagating \(\mathcal{L}_{imit}\) through the MPC solver |
| 2. Online PPO | agent/PPO_MPC_EP.py |
Wraps MPC output in a Gaussian policy; updates \(F\), \(B_d\) via PPO to maximize reward |
The diff_mpc/ module is called by both phases but is never modified during training — only the dynamics parameters (\(F\), \(B_d\)) that are passed into it are updated.
18.4 Understanding the Pre-Computed Results
The authors provide pre-computed results in agent/results/ so you can visualize the full training pipeline without running EnergyPlus. The code below is adapted from Demo.ipynb. Run these cells to reproduce the four key plots from the paper.
18.4.1 Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
filePath = "agent/results/"
lam = 10 # weight on state loss in the imitation objective18.4.2 Imitation Learning Loss Curves
This plot shows how \(\mathcal{L}_{state}\), \(\mathcal{L}_{action}\), and the combined \(\mathcal{L}_{imit} = \lambda \mathcal{L}_{state} + \mathcal{L}_{action}\) evolve over 20 epochs. Look for the train/validation gap — a large gap would indicate overfitting of the dynamics parameters.
imit_loss = pd.read_pickle(filePath + "Imit_loss_rl.pkl")
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for ax, title, train_key, val_key in [
(axes[0], r"$\mathcal{L}_{state}$", "train_state_loss", "val_state_loss"),
(axes[1], r"$\mathcal{L}_{action}$", "train_action_loss", "val_action_loss"),
]:
ax.plot(imit_loss[train_key], label="train")
ax.plot(imit_loss[val_key], label="val")
ax.set_title(title)
ax.set_xlabel("Epoch")
ax.set_xlim((0, 19))
ax.legend()
# Combined loss
axes[2].plot(lam * imit_loss["train_state_loss"] + imit_loss["train_action_loss"], label="train")
axes[2].plot(lam * imit_loss["val_state_loss"] + imit_loss["val_action_loss"], label="val")
axes[2].set_title(r"$\mathcal{L}_{imit}$")
axes[2].set_xlabel("Epoch")
axes[2].set_xlim((0, 19))
axes[2].legend()
plt.tight_layout()
plt.show()18.4.3 Agent Behavior After Pretraining
After imitation learning, the agent should track the expert controller’s states and actions closely. Each epoch’s results are saved as Imit_rl_{epoch}.pkl — try different epochs to see how tracking improves during training.
epoch = 19 # best epoch (last one); try 0 vs. 19 to see improvement
imit_record = pd.read_pickle(filePath + f"Imit_rl_{epoch}.pkl")
# Setpoint and occupancy live in the simulation dataset, not the imitation record
sim_data = pd.read_pickle(filePath + "Sim-TMY2.pkl").loc[imit_record.index]
# Reset to integer index so set_xlim works with integer bounds
imit_record = imit_record.reset_index(drop=True)
sim_data = sim_data.reset_index(drop=True)
start_time, end_time = 0, 500 # adjust to zoom in/out
fig, axes = plt.subplots(2, 1, figsize=(20, 5), sharex=True)
axes[0].plot(imit_record["Expert nState"], label="Expert")
axes[0].plot(imit_record["Learner nState"], label="Learner")
axes[0].plot(sim_data["Indoor Temp. Setpoint"], "k--", label="Setpoint")
axes[0].set_ylabel("Next State (Temperature)")
axes[0].set_xlim((start_time, end_time))
axes[0].legend()
axes[1].plot(imit_record["Expert action"], label="Expert")
axes[1].plot(imit_record["Learner action"], label="Learner")
axes[1].plot(sim_data["Occupancy Flag"] * 5, "k--", label="Occupancy")
axes[1].set_ylabel("Action (Supply Temp.)")
axes[1].set_xlim((start_time, end_time))
axes[1].legend()
plt.tight_layout()
plt.show()18.4.4 Online Learning Performance
After PPO refinement, compare Gnu-RL against the EnergyPlus baseline controller. Notice how Gnu-RL learns to pre-heat before occupancy periods — the agent anticipates the setpoint change rather than reacting to it.
baseline = pd.read_pickle(filePath + "Sim-TMY3.pkl").reset_index(drop=True)
rl = pd.read_pickle(filePath + "perf_rl_obs.pkl").reset_index(drop=True)
start_time, end_time = 2000, 2500 # a representative week
fig, axes = plt.subplots(2, 1, figsize=(20, 6), sharex=True)
axes[0].plot(baseline["Indoor Temp."], "b-", label="Baseline")
axes[0].plot(rl["Indoor Temp."], "r-", label="Gnu-RL")
axes[0].plot(rl["Indoor Temp. Setpoint"], "k--", label="Setpoint")
axes[0].set_ylabel("Indoor Temperature")
axes[0].set_xlim([start_time, end_time])
axes[0].legend()
axes[1].plot(baseline["Sys Out Temp."], "b", label="Baseline")
axes[1].plot(rl["Sys Out Temp."], "r", label="Gnu-RL")
axes[1].plot(baseline["Occupancy Flag"] * 30, "k--", label="Occupancy")
axes[1].set_ylabel("Supply Air Temp. (Action)")
axes[1].set_xlim([start_time, end_time])
axes[1].legend()
plt.tight_layout()
plt.show()18.4.5 Residue Reward Over Time
The residue reward is the difference between Gnu-RL’s reward and the baseline’s reward at each timestep. A positive trend confirms that PPO is improving the policy over the 90-day online learning period.
rl_perf = np.load(filePath + "perf_rl.npy", allow_pickle=True)
baseline_perf = pd.read_pickle(filePath + "Sim-TMY3.pkl")
# Compute residue reward (moving average with confidence band)
residue = pd.Series(rl_perf[:, 0]) # reward column (one value per episode/day)
RewardStats = residue.rolling(window=7).agg(["mean", "std"]).dropna() # 7-day moving average
RewardStats.columns = ["Mean", "Std"]
fig = plt.figure(figsize=(10, 4))
plt.plot(RewardStats["Mean"])
plt.fill_between(
RewardStats.index,
RewardStats["Mean"] - 1.645 * RewardStats["Std"],
RewardStats["Mean"] + 1.645 * RewardStats["Std"],
alpha=0.2,
)
plt.axhline(y=0, color="k", linestyle="--")
plt.ylabel("Residue Reward")
plt.xlabel("Episode (day)")
plt.tight_layout()
plt.show()18.5 Code Deep-Dive: The Differentiable MPC Policy
You do not need to modify anything inside diff_mpc/. What you do need is a clear understanding of the interface — what you pass in and what you get back — so you can wire it into your own training loop for SustainDC.
18.5.1 The MPC Interface
The solver is called as:
x, u, _ = mpc.MPC(n_state, n_ctrl, T, **options)(x_init, QuadCost(C, c), LinDx(F, f))The “Shape” column below gives the PyTorch tensor dimensions. These are 3D or 4D tensors because the solver is vectorized over a batch dimension (\(B\)) and a time dimension (\(T\)). Let \(n = n_{state}\), \(m = n_{ctrl}\), \(B = n_{batch}\):
| Argument | Shape | Dimensions | Description |
|---|---|---|---|
x_init |
[n_batch, n_state] |
\(\in \mathbb{R}^{B \times n}\) | Current state (e.g., zone temperatures) |
C |
[T, n_batch, n_state+n_ctrl, n_state+n_ctrl] |
\(\in \mathbb{R}^{T \times B \times (n+m) \times (n+m)}\) | Quadratic cost matrix (diagonal: weights on states and controls) |
c |
[T, n_batch, n_state+n_ctrl] |
\(\in \mathbb{R}^{T \times B \times (n+m)}\) | Linear cost vector (encodes target setpoints) |
F |
[T-1, n_batch, n_state, n_state+n_ctrl] |
\(\in \mathbb{R}^{(T-1) \times B \times n \times (n+m)}\) | Dynamics matrix: \(x_{t+1} = F \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t\) |
f |
[T-1, n_batch, n_state] |
\(\in \mathbb{R}^{(T-1) \times B \times n}\) | Affine dynamics term (disturbance contribution: \(B_d d_t\)) |
| Return | Shape | Dimensions | Description |
|---|---|---|---|
x |
[T, n_batch, n_state] |
\(\in \mathbb{R}^{T \times B \times n}\) | Optimal state trajectory |
u |
[T, n_batch, n_ctrl] |
\(\in \mathbb{R}^{T \times B \times m}\) | Optimal control trajectory (apply u[0] only — receding horizon) |
Key solver options: u_lower/u_upper (control bounds), lqr_iter (number of iLQR iterations, typically 20), exit_unconverged=False (don’t error if solver doesn’t fully converge).
18.5.2 How \(F\) and \(f\) Encode the Dynamics
The dynamics \(x_{t+1} = Ax_t + B_u u_t + B_d d_t\) are split into two parts for the solver:
- \(F\): The concatenation \([A \mid B_u]\), shape
[n_state, n_state + n_ctrl]. This is learnable. - \(f\): The disturbance term \(B_d d_t\), pre-computed for the planning horizon. \(B_d\) is learnable; \(d_t\) comes from weather/workload forecasts.
In code (from Imit_EP.py):
# F_hat is [n_state, n_state + n_ctrl] --- the learnable dynamics
F_hat = torch.ones((n_state, n_state + n_ctrl)).double().requires_grad_()
# Bd_hat is [n_state, n_dist] --- the learnable disturbance matrix
Bd_hat = torch.tensor(np.random.rand(n_state, n_dist)).requires_grad_()
# f is computed by multiplying Bd_hat with the disturbance forecast
dt = ... # [n_dist, T-1] disturbance values over the horizon
ft = torch.mm(Bd_hat, dt).transpose(0, 1).unsqueeze(1) # [T-1, 1, n_state]For Assignment 3, you will define what \(x\), \(u\), and \(d\) mean for SustainDC’s cooling environment, then initialize F_hat and Bd_hat with appropriate shapes.
18.5.3 How \(C\) and \(c\) Encode the Cost Function
The MPC minimizes \(\sum_t \frac{1}{2} \tau_t^T C_t \tau_t + c_t^T \tau_t\) where \(\tau_t = [x_t; u_t]\). In Gnu-RL, the cost balances comfort against energy:
# Diagonal of C: [eta, eta, ..., 0.001, 0.001, ...]
# ^-- n_state --^ ^-- n_ctrl --^
# eta depends on occupancy: high when occupied, low when not
diag[:, :n_state] = eta_w_flag # comfort weight (varies by timestep)
diag[:, n_state:] = 0.001 # energy weight (constant)
# Linear cost c encodes the setpoint target
c[:, :n_state] = -eta * x_target # drives state toward setpoint
c[:, n_state:] = 1 # L1 penalty on control effortFor Assignment 3, you will design a cost function appropriate for data center cooling — balancing temperature constraint satisfaction against energy consumption (and potentially carbon intensity).
18.5.4 Differentiability
The solver’s backward pass (in lqr_step.py) computes \(\partial u^* / \partial F\) and \(\partial u^* / \partial B_d\) by differentiating through the KKT conditions of the optimization. You don’t need to understand the internals — just know that if F_hat and Bd_hat have requires_grad=True, then calling loss.backward() on any loss computed from the MPC output will produce gradients for these parameters.
18.6 Code Deep-Dive: Imitation Learning
agent/Imit_EP.py contains the Learner class that implements Phase 1 of the training pipeline. This is the file you will adapt most directly for Assignment 3.
18.6.1 The Learner Class
The class has four key methods:
| Method | What it does | Assignment 3 adaptation |
|---|---|---|
__init__ |
Initializes F_hat, Bd_hat with requires_grad=True; sets up Adam optimizer |
Change shapes to match SustainDC’s state/control/disturbance dimensions |
Cost_function(cur_time) |
Builds \(C\), \(c\) matrices from occupancy schedule and setpoint targets | Replace with SustainDC’s cost structure (temperature bounds, energy, carbon) |
forward(x_init, C, c, cur_time) |
Calls mpc.MPC(...) and returns next predicted state + first control action |
Replace disturbance lookup (EnergyPlus timestamps \(\to\) SustainDC observations) |
predict(x_init, action, cur_time) |
One-step dynamics: \(\hat{x}_{t+1} = F \cdot [x_t; u_t] + B_d \cdot d_t\) | Same timestamp adaptation as forward |
update_parameters(...) |
Computes \(\mathcal{L}_{imit}\), calls backward(), steps optimizer |
Reusable as-is (loss structure doesn’t depend on environment) |
18.6.2 The Training Loop
The loop in main() follows this structure per epoch:
- Sample a random timestep from the training set
- Get expert data: the existing controller’s state \(x_t\) and action \(u_t\) at that timestep
- Run MPC (
learner.forward): given \(x_t\), produce the learner’s predicted action \(\hat{u}_t\) - Predict next state (
learner.predict): given \(x_t\) and the expert’s \(u_t\), produce \(\hat{x}_{t+1}\) - Every 256 samples: compute loss and backpropagate
The key detail: step 3 uses the MPC to generate \(\hat{u}\), while step 4 uses one-step dynamics to generate \(\hat{x}\). Both pathways produce gradients for F_hat and Bd_hat.
For Assignment 3, you will collect expert demonstrations by running your baseline controller in SustainDC, then adapt this loop to sample from that collected data.
18.6.3 The Imitation Loss
state_loss = mean((x_true - x_pred)^2)
action_loss = mean((u_true - u_pred)^2)
traj_loss = lambda * state_loss + action_loss # lambda = eta = 5The \(\lambda\) parameter balances two objectives:
- State loss (dynamics accuracy): does the learned model predict the right next state?
- Action loss (behavioral cloning): does the MPC produce the same action as the expert?
Both matter. Good dynamics alone don’t guarantee good control (this is the paper’s key finding vs. system identification).
18.7 Code Deep-Dive: Online PPO
agent/PPO_MPC_EP.py contains the PPO class that implements Phase 2. It reuses the same F_hat/Bd_hat parameters (loaded from imitation learning weights) and continues updating them via policy gradient.
18.7.1 From MPC Solution to Stochastic Policy
The MPC produces a deterministic action \(u^*\). PPO needs a stochastic policy for exploration. The solution is simple — wrap \(u^*\) in a Gaussian:
# In PPO.select_action():
dist = Normal(mu=u_mpc, sigma=sigma) # sigma decays over episodes
action = dist.sample()
log_prob = dist.log_prob(action)\(\sigma\) starts at 1.0 and decays linearly to 0.1 over 90 episodes (sigma = 1 - 0.9 * i_episode / tol_eps), so the agent explores broadly early and exploits later.
18.7.2 The PPO Update
The update_parameters method implements the clipped surrogate objective:
# Recompute MPC action under current parameters
opt_states, opt_actions = self.forward(states, f, C, c, current=True)
log_probs, _ = self.evaluate_action(opt_actions[0], actions, sigma)
# PPO clipped surrogate
ratio = exp(log_probs - old_log_probs)
surr1 = ratio * advantage
surr2 = clamp(ratio, 1 - epsilon, 1 + epsilon) * advantage
loss = -min(surr1, surr2).mean()
loss.backward() # gradients flow through MPC to F_hat, Bd_hat
optimizer.step()The critical point: loss.backward() differentiates through the MPC solver, so F_hat and Bd_hat are updated to maximize expected reward, not just to minimize prediction error.
18.7.3 Environment Interaction Loop
The main() function contains the EnergyPlus-specific interaction loop. For Assignment 3, this is the main part you replace:
| Gnu-RL (EnergyPlus) | Your code (SustainDC) |
|---|---|
env = gym.make('5Zone-control_TMY3-v0') |
env = gym.make('sustaindc/SustainDC-v0', ...) |
timeStep, obs, isTerminal = env.reset() |
obs, info = env.reset() |
timeStep, obs, isTerminal = env.step([SAT_stpt]) |
obs, reward, terminated, truncated, info = env.step(action) |
| Disturbances from pre-loaded weather CSV | Disturbances from SustainDC’s observation space |
Reward via R_func(obs_dict, action, eta) |
Reward from env.step() or custom reward function |
| Episode = 1 natural day (96 steps @ 15 min) | Episode defined by SustainDC config |
The PPO class itself (the MPC call, action selection, parameter update) is largely reusable — it’s the environment interaction in main() that needs rewriting.
18.8 Hands-On: End-to-End Example on ssM_env
This section walks through the complete Gnu-RL pipeline on ssM_env — a lightweight linear system that requires no EnergyPlus. The pattern here is exactly what you will replicate with SustainDC in Assignment 3.
18.8.1 Step 1: Define the Environment and Collect Expert Data
import numpy as np
import torch
import torch.optim as optim
import matplotlib.pyplot as plt
from ssM_env.env import ssModel
from diff_mpc.mpc import MPC, QuadCost, LinDx
# --- Environment: single-zone thermal model ---
A = np.array([[0.95]]) # thermal inertia
Bu = np.array([[0.05]]) # heating effectiveness
Bd = np.array([[0.02, 0.01]]) # weather sensitivity [outdoor_temp, solar]
C_obs = np.array([[1.0]]) # observe temperature directly
# 1 day of disturbance data (96 steps @ 15 min)
n_steps = 96
hours = np.arange(n_steps) / 4
outdoor_temp = 5 + 10 * np.sin(2 * np.pi * (hours - 6) / 24)
solar = np.maximum(0, 3 * np.sin(2 * np.pi * (hours - 6) / 24))
d = np.stack([outdoor_temp, solar]) # [2, 96]
ssMatrix = {'A': A, 'Bu': Bu, 'Bd': Bd, 'C': C_obs, 'd': d}
env = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)
# --- Collect expert data with a proportional controller ---
Kp = 2.0
expert_states, expert_actions, expert_disturbances = [], [], []
obs = env.reset()
for t in range(n_steps - 1):
temp = obs[0]
action = np.array([Kp * max(0, 22 - temp)]) # P-controller toward setpoint
expert_states.append(temp)
expert_actions.append(action[0])
expert_disturbances.append(d[:, t])
obs, reward = env.step(action)
expert_states.append(obs[0]) # final state
print(f"Collected {len(expert_actions)} expert transitions")18.8.2 Step 2: Learn Dynamics from Expert Data
We initialize F_hat and Bd_hat with wrong values, then learn them by minimizing the one-step state prediction error \(\|F[x_t; u_t] + B_d d_t - x_{t+1}\|^2\). This is the state prediction component of Gnu-RL’s imitation loss (the action matching component is added in the full pipeline with more data).
n_state, n_ctrl, n_dist = 1, 1, 2
# Vectorize training data
X = torch.tensor(
np.column_stack([expert_states[:-1], expert_actions]), dtype=torch.float64) # [N, 2]
D = torch.tensor(d[:, :len(expert_actions)].T, dtype=torch.float64) # [N, 2]
Y = torch.tensor(expert_states[1:], dtype=torch.float64) # [N]
# --- Learnable parameters (initialized wrong) ---
F_hat = torch.tensor([[0.5, 0.2]], dtype=torch.float64, requires_grad=True) # true: [0.95, 0.05]
Bd_hat = torch.tensor([[0.1, 0.1]], dtype=torch.float64, requires_grad=True) # true: [0.02, 0.01]
optimizer = optim.Adam([F_hat, Bd_hat], lr=0.05)
losses = []
for epoch in range(500):
optimizer.zero_grad()
pred = (X @ F_hat.T + D @ Bd_hat.T).squeeze() # vectorized one-step prediction
loss = ((pred - Y)**2).mean()
loss.backward()
optimizer.step()
losses.append(loss.item())
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch+1:3d} loss={losses[-1]:.6f} "
f"F_hat={F_hat.detach().numpy().round(3)} "
f"Bd_hat={Bd_hat.detach().numpy().round(3)}")
print(f"\nTrue F: [0.95, 0.05] Learned: {F_hat.detach().numpy().round(3)}")
print(f"True Bd: [0.02, 0.01] Learned: {Bd_hat.detach().numpy().round(3)}")plt.plot(losses)
plt.xlabel("Epoch")
plt.ylabel("State Prediction Loss")
plt.title("Learning Dynamics Parameters")
plt.show()You should see F_hat converge toward [0.95, 0.05] and Bd_hat toward [0.02, 0.01].
18.8.3 Step 3: Run the MPC with Learned Dynamics
Now verify that the learned dynamics produce a useful MPC controller — this closes the loop between learning and control:
T = 12
C_mpc = torch.zeros(T, 1, n_state + n_ctrl, n_state + n_ctrl, dtype=torch.float64)
C_mpc[:, :, 0, 0] = 1.0 # comfort weight
C_mpc[:, :, 1, 1] = 0.001 # energy weight
c_mpc = torch.zeros(T, 1, n_state + n_ctrl, dtype=torch.float64)
c_mpc[:, :, 0] = -22.0 # target = 22
env2 = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)
obs = env2.reset()
mpc_temps, mpc_actions = [obs[0]], []
with torch.no_grad():
for t in range(n_steps - 1):
x_init = torch.tensor([[obs[0]]], dtype=torch.float64)
dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
if dt.shape[1] < T - 1:
dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
ft = torch.mm(Bd_hat.detach(), dt).T.unsqueeze(1)
x_opt, u_opt, _ = MPC(
n_state, n_ctrl, T,
u_lower=0.0, u_upper=50.0,
lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc),
LinDx(F_hat.detach().repeat(T-1, 1, 1, 1), ft))
action = np.array([max(0, u_opt[0, 0, 0].item())])
mpc_actions.append(action[0])
obs, reward = env2.step(action)
mpc_temps.append(obs[0])
hours = np.arange(len(mpc_temps)) / 4
fig, axes = plt.subplots(2, 1, figsize=(12, 5), sharex=True)
axes[0].plot(hours, mpc_temps, label="MPC (learned dynamics)")
axes[0].axhline(y=22, color='k', linestyle='--', label="Setpoint")
axes[0].set_ylabel("Temperature")
axes[0].legend()
axes[1].step(hours[:-1], mpc_actions, where='post')
axes[1].set_ylabel("Control Action")
axes[1].set_xlabel("Hours")
plt.tight_layout()
plt.show()18.8.4 Step 4: Online Learning
Steps 2–3 used a two-stage approach: learn dynamics offline from expert data, then deploy with fixed parameters. But the deployed controller visits different states than the expert did, so the learned dynamics may be inaccurate in the regions that matter most. Online learning refines \(F\) and \(B_d\) from the controller’s own experience.
We show two approaches that mirror the paper’s two phases:
- 4a — Online dynamics refinement: update \(F\), \(B_d\) to minimize prediction error on transitions the controller actually generates. Simple, but optimizes for accuracy, not control.
- 4b — PPO through the MPC: update \(F\), \(B_d\) via policy gradient to maximize reward. This is Gnu-RL’s key contribution — the dynamics converge to values that produce good control, even if they don’t match the true system exactly.
18.8.4.1 Gradient flow check
Before doing online learning, verify that gradients actually flow through the MPC solver back to the dynamics parameters:
F_test = F_hat.detach().clone().requires_grad_(True)
Bd_test = Bd_hat.detach().clone().requires_grad_(True)
x_init = torch.tensor([[21.0]], dtype=torch.float64)
dt = torch.tensor(d[:, 0:T-1], dtype=torch.float64)
ft = torch.mm(Bd_test, dt).T.unsqueeze(1)
x_opt, u_opt, _ = MPC(
n_state, n_ctrl, T,
lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc), LinDx(F_test.repeat(T-1, 1, 1, 1), ft))
target_loss = (x_opt[1, 0, 0] - 22.0)**2
target_loss.backward()
print(f"dL/dF = {F_test.grad.numpy().round(4)}")
print(f"dL/dBd = {Bd_test.grad.numpy().round(4)}")
print("Non-zero gradients confirm: the MPC solver is differentiable!")18.8.4.2 4a: Online dynamics refinement
The simplest approach: run the MPC controller for one episode, collect the transitions \((x_t, u_t, d_t, x_{t+1})\), then do a batch gradient step on the prediction error \(\|\hat{x}_{t+1} - x_{t+1}\|^2\). Repeat for several episodes — each episode uses improved dynamics, visits better states, and generates better training data.
# Start from Step 2's learned dynamics
F_online = F_hat.detach().clone().requires_grad_(True)
Bd_online = Bd_hat.detach().clone().requires_grad_(True)
opt_online = optim.Adam([F_online, Bd_online], lr=0.01)
n_episodes = 10
online_rewards = []
for ep in range(n_episodes):
env_ep = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)
obs = env_ep.reset()
# --- Rollout: control with current dynamics, collect data ---
ep_states, ep_actions, ep_dist, ep_next = [], [], [], []
ep_reward = []
for t in range(n_steps - 1):
x_t = obs[0]
x_init = torch.tensor([[x_t]], dtype=torch.float64)
dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
if dt.shape[1] < T - 1:
dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
ft = torch.mm(Bd_online.detach(), dt).T.unsqueeze(1)
with torch.no_grad():
_, u_opt, _ = MPC(n_state, n_ctrl, T,
u_lower=0.0, u_upper=50.0, lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc),
LinDx(F_online.detach().repeat(T-1, 1, 1, 1), ft))
action = max(0, u_opt[0, 0, 0].item())
ep_states.append(x_t)
ep_actions.append(action)
ep_dist.append(d[:, t])
obs, reward = env_ep.step(np.array([action]))
ep_next.append(obs[0])
ep_reward.append(reward)
# --- Batch update on collected episode data ---
X = torch.tensor(np.column_stack([ep_states, ep_actions]), dtype=torch.float64)
D = torch.tensor(np.array(ep_dist), dtype=torch.float64)
Y = torch.tensor(ep_next, dtype=torch.float64)
opt_online.zero_grad()
pred = (X @ F_online.T + D @ Bd_online.T).squeeze()
loss = ((pred - Y)**2).mean()
loss.backward()
opt_online.step()
avg_r = np.mean(ep_reward)
online_rewards.append(avg_r)
print(f"Episode {ep+1}: avg_reward={avg_r:.2f} pred_loss={loss.item():.4f} "
f"F={F_online.detach().numpy().round(4)} "
f"Bd={Bd_online.detach().numpy().round(4)}")You should see both the prediction loss and average reward improve over episodes as the dynamics become more accurate in the regions the controller actually visits.
18.8.4.3 4b: PPO through the Differentiable MPC
The PPO approach is more powerful: instead of minimizing prediction error, it directly maximizes the environment reward by backpropagating policy gradients through the MPC solver. This is the core of Gnu-RL’s Phase 2.
The key steps per episode:
- Rollout: run the MPC, wrap its output \(u^*\) in a Gaussian \(\mathcal{N}(u^*, \sigma)\) for exploration, record \((s, a, r, \log\pi_{old})\)
- Advantages: normalize rewards as a simple baseline
- PPO update: re-run the MPC (with gradients), compute the new log-probability of each recorded action, and apply the clipped surrogate objective
# Start fresh from Step 2's learned dynamics
F_ppo = F_hat.detach().clone().requires_grad_(True)
Bd_ppo = Bd_hat.detach().clone().requires_grad_(True)
opt_ppo = optim.Adam([F_ppo, Bd_ppo], lr=0.003)
n_episodes = 10
sigma_start, sigma_end = 0.5, 0.05 # exploration noise (decays)
eps_clip = 0.2
ppo_rewards = []
for ep in range(n_episodes):
sigma = sigma_start - (sigma_start - sigma_end) * ep / max(n_episodes - 1, 1)
env_ep = ssModel(ssMatrix, x_init=18, x_target=22, eta=5)
obs = env_ep.reset()
# --- Rollout with stochastic policy ---
rollout = [] # list of (x_t, timestep, action, log_prob_old, reward)
for t in range(n_steps - 1):
x_t = obs[0]
x_init = torch.tensor([[x_t]], dtype=torch.float64)
dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
if dt.shape[1] < T - 1:
dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
ft = torch.mm(Bd_ppo.detach(), dt).T.unsqueeze(1)
with torch.no_grad():
_, u_opt, _ = MPC(n_state, n_ctrl, T,
u_lower=0.0, u_upper=50.0, lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc),
LinDx(F_ppo.detach().repeat(T-1, 1, 1, 1), ft))
mu = u_opt[0, 0, 0].item()
dist = torch.distributions.Normal(mu, sigma)
a_sampled = dist.sample()
action = max(0.0, a_sampled.item())
log_prob_old = dist.log_prob(torch.tensor(action)).item()
obs, reward = env_ep.step(np.array([action]))
rollout.append((x_t, t, action, log_prob_old, reward))
rewards_arr = np.array([r[4] for r in rollout])
advantages = (rewards_arr - rewards_arr.mean()) / (rewards_arr.std() + 1e-8)
# --- PPO update: re-run MPC with gradients ---
ppo_loss = torch.tensor(0.0, dtype=torch.float64)
for idx, (x_t, t, action, lp_old, _) in enumerate(rollout):
x_init = torch.tensor([[x_t]], dtype=torch.float64)
dt = torch.tensor(d[:, t:t+min(T-1, n_steps-1-t)], dtype=torch.float64)
if dt.shape[1] < T - 1:
dt = torch.cat([dt, dt[:, -1:].repeat(1, T-1-dt.shape[1])], dim=1)
ft = torch.mm(Bd_ppo, dt).T.unsqueeze(1) # gradients flow here
_, u_opt, _ = MPC(n_state, n_ctrl, T,
u_lower=0.0, u_upper=50.0, lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc),
LinDx(F_ppo.repeat(T-1, 1, 1, 1), ft)) # and here
mu_new = u_opt[0, 0, 0]
dist_new = torch.distributions.Normal(mu_new, sigma)
log_prob_new = dist_new.log_prob(torch.tensor(action))
ratio = torch.exp(log_prob_new - lp_old)
adv = torch.tensor(advantages[idx], dtype=torch.float64)
surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - eps_clip, 1 + eps_clip) * adv
ppo_loss = ppo_loss - torch.min(surr1, surr2)
ppo_loss = ppo_loss / len(rollout)
opt_ppo.zero_grad()
ppo_loss.backward()
opt_ppo.step()
avg_r = rewards_arr.mean()
ppo_rewards.append(avg_r)
print(f"Episode {ep+1:2d}/{n_episodes} sigma={sigma:.2f} "
f"avg_reward={avg_r:.2f} "
f"F={F_ppo.detach().numpy().round(4)} "
f"Bd={Bd_ppo.detach().numpy().round(4)}")fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(range(1, len(online_rewards)+1), online_rewards, 'o-', label="Online dynamics (4a)")
ax.plot(range(1, len(ppo_rewards)+1), ppo_rewards, 's-', label="PPO (4b)")
ax.set_xlabel("Episode")
ax.set_ylabel("Average Reward")
ax.set_title("Online Learning: Prediction-Based vs. Reward-Based")
ax.legend()
plt.tight_layout()
plt.show()Both approaches improve the controller, but they optimize for different things. The dynamics refinement (4a) makes the model more accurate at predicting states — which indirectly improves control. PPO (4b) makes the model produce better actions — the dynamics may not match the true system exactly, but they converge to values that maximize reward. This is the paper’s central claim: optimizing dynamics for control performance (through the differentiable MPC) outperforms pure system identification.
18.9 Gnu-RL on BOPTEST
The ssM_env example above uses a toy linear system where we know the true dynamics. Before jumping to SustainDC (a data center), let’s apply Gnu-RL to a realistic building via the BOPTEST framework you used in Lecture 9. This is a natural stepping stone: BOPTEST is a building environment with continuous actions — exactly what Gnu-RL was designed for.
18.9.1 The Mapping
For the bestest_hydronic_heat_pump test case:
| Gnu-RL concept | BOPTEST equivalent | Variable |
|---|---|---|
| State \(x\) | Zone temperature | reaTZon_y (in K) |
| Action \(u\) | Heat pump modulation | oveHeaPumY_u (\(\in [0, 1]\)) |
| Disturbance \(d\) | Outdoor temp, solar radiation | TDryBul, HGloHor via /forecast |
| Target | Heating setpoint | reaTSetHea_y (in K) |
| Step size | 15 min (900 s) | Same as Gnu-RL’s original setup |
18.9.2 Step 1: Collect Expert Data from BOPTEST
We run a simple proportional controller for 2 days and record \((x, u, d)\) tuples — the same data collection step as agent/simulate.py in Gnu-RL:
import requests
import numpy as np
import torch
url = 'http://13.217.71.86'
testcase = 'bestest_hydronic_heat_pump'
# Select and initialize
testid = requests.post(f'{url}/testcases/{testcase}/select').json()['testid']
requests.put(f'{url}/initialize/{testid}', json={'start_time': 0, 'warmup_period': 0})
requests.put(f'{url}/step/{testid}', json={'step': 900}) # 15-min steps
# --- Collect expert data with a proportional controller ---
K_p = 2.0
T_set_C = 21.0
T_set_K = 273.15 + T_set_C
n_steps = 2 * 96 # 2 days
states, actions, disturbances = [], [], []
y = requests.post(f'{url}/advance/{testid}', json={}).json()['payload']
for t in range(n_steps):
T_zone = y['reaTZon_y']
T_outdoor = y['weaSta_reaWeaTDryBul_y']
solar = y['weaSta_reaWeaHGloHor_y']
# Proportional controller (expert)
error = T_set_K - T_zone
u_heat = max(0.0, min(K_p * error, 1.0))
states.append(T_zone)
actions.append(u_heat)
disturbances.append([T_outdoor, solar])
y = requests.post(f'{url}/advance/{testid}', json={
'oveHeaPumY_u': u_heat,
'oveHeaPumY_activate': 1
}).json()['payload']
states.append(y['reaTZon_y']) # final state
states = np.array(states)
actions = np.array(actions)
disturbances = np.array(disturbances) # [n_steps, 2]
print(f"Collected {len(actions)} transitions, temp range: "
f"{(states.min()-273.15):.1f}--{(states.max()-273.15):.1f} °C")18.9.3 Step 2: Learn Dynamics from Expert Data
Same approach as the ssM_env example — learn \(F\) and \(B_d\) by minimizing one-step state prediction error on the collected data:
from diff_mpc.mpc import MPC, QuadCost, LinDx
n_state, n_ctrl, n_dist = 1, 1, 2
T = 5 # planning horizon
# Normalize disturbances for numerical stability
d_mean = disturbances.mean(axis=0)
d_std = disturbances.std(axis=0) + 1e-8
d_norm = (disturbances - d_mean) / d_std
# Vectorize training data
X = torch.tensor(np.column_stack([states[:-1], actions]), dtype=torch.float64) # [N, 2]
D = torch.tensor(d_norm, dtype=torch.float64) # [N, 2]
Y = torch.tensor(states[1:], dtype=torch.float64) # [N]
# Learnable parameters (initialized wrong)
F_hat = torch.tensor([[0.5, 0.3]], dtype=torch.float64, requires_grad=True)
Bd_hat = torch.tensor([[0.1, 0.1]], dtype=torch.float64, requires_grad=True)
optimizer = torch.optim.Adam([F_hat, Bd_hat], lr=0.05)
losses = []
for epoch in range(300):
optimizer.zero_grad()
pred = (X @ F_hat.T + D @ Bd_hat.T).squeeze()
loss = ((pred - Y)**2).mean()
loss.backward()
optimizer.step()
losses.append(loss.item())
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch+1:3d} loss={losses[-1]:.6f} "
f"F={F_hat.detach().numpy().round(4)} "
f"Bd={Bd_hat.detach().numpy().round(4)}")18.9.4 Step 3: Deploy the Learned Controller
After training, use the learned dynamics to run the MPC as a controller on a fresh BOPTEST episode:
import matplotlib.pyplot as plt
# Cost matrices for MPC
C_mpc = torch.zeros(T, 1, n_state + n_ctrl, n_state + n_ctrl, dtype=torch.float64)
C_mpc[:, :, 0, 0] = 1.0 # comfort weight
C_mpc[:, :, 1, 1] = 0.01 # energy weight
c_mpc = torch.zeros(T, 1, n_state + n_ctrl, dtype=torch.float64)
c_mpc[:, :, 0] = -T_set_K # target setpoint
# Fresh test instance
testid2 = requests.post(f'{url}/testcases/{testcase}/select').json()['testid']
requests.put(f'{url}/initialize/{testid2}', json={'start_time': 0, 'warmup_period': 0})
requests.put(f'{url}/step/{testid2}', json={'step': 900})
mpc_temps, mpc_actions = [], []
y = requests.post(f'{url}/advance/{testid2}', json={}).json()['payload']
for t in range(96): # 1 day
T_zone = y['reaTZon_y']
# Get disturbance forecast for the planning horizon
fc = requests.put(f'{url}/forecast/{testid2}', json={
'point_names': ['TDryBul', 'HGloHor'],
'horizon': (T - 1) * 900,
'interval': 900
}).json()['payload']
d_fc = np.column_stack([fc['TDryBul'], fc['HGloHor']]) # forecast returns lists
if len(d_fc) < T - 1:
d_fc = np.pad(d_fc, ((0, T - 1 - len(d_fc)), (0, 0)), 'edge')
d_fc_norm = (d_fc[:T-1] - d_mean) / d_std
# Run MPC with learned dynamics
x_init = torch.tensor([[T_zone]], dtype=torch.float64)
dt = torch.tensor(d_fc_norm.T, dtype=torch.float64)
ft = torch.mm(Bd_hat.detach(), dt).T.unsqueeze(1)
with torch.no_grad():
x_opt, u_opt, _ = MPC(
n_state, n_ctrl, T,
u_lower=0.0, u_upper=1.0,
lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc),
LinDx(F_hat.detach().repeat(T-1, 1, 1, 1), ft))
u_mpc = u_opt[0, 0, 0].item()
mpc_temps.append(T_zone - 273.15)
mpc_actions.append(u_mpc)
y = requests.post(f'{url}/advance/{testid2}', json={
'oveHeaPumY_u': u_mpc,
'oveHeaPumY_activate': 1
}).json()['payload']
mpc_temps.append(y['reaTZon_y'] - 273.15)
kpis = requests.get(f'{url}/kpi/{testid2}').json()['payload']
# Plot
hours = np.arange(len(mpc_temps)) / 4
fig, axes = plt.subplots(2, 1, figsize=(14, 5), sharex=True)
axes[0].plot(hours, mpc_temps, label="Gnu-RL (learned MPC)")
axes[0].axhline(y=T_set_C, color='k', linestyle='--', label="Setpoint")
axes[0].set_ylabel("Zone Temp (°C)")
axes[0].legend()
axes[1].step(hours[:-1], mpc_actions, where='post')
axes[1].set_ylabel("Heat Pump Signal")
axes[1].set_xlabel("Hours")
plt.tight_layout()
plt.show()
print(f"Energy: {kpis['ener_tot']:.2f} kWh, Discomfort: {kpis['tdis_tot']:.4f} Kh")18.9.5 Step 4: Online PPO Learning
Step 3 deployed the MPC with fixed dynamics. Now we close the loop: run a single continuous simulation over several days, and at the end of each day use that day’s trajectory to update \(F\) and \(B_d\) via PPO. This mirrors how Gnu-RL would operate in a real building — the controller runs continuously and refines its model overnight.
Each day has two phases:
- Rollout (96 steps, interacts with BOPTEST): run the MPC, add Gaussian noise for exploration, record \((x_t, d_t, a_t, \log\pi_{old}, r_t)\)
- PPO update (local only, no API calls): re-run the MPC with gradients on the stored inputs, compute the clipped surrogate loss, backpropagate through the MPC to update \(F\) and \(B_d\)
import torch.optim as optim
F_ppo = F_hat.detach().clone().requires_grad_(True)
Bd_ppo = Bd_hat.detach().clone().requires_grad_(True)
opt_ppo = optim.Adam([F_ppo, Bd_ppo], lr=0.001)
n_days = 5
sigma_start, sigma_end = 0.3, 0.05 # exploration noise (decays)
eps_clip = 0.2
def boptest_reward(T_zone_K, u):
"""Reward: penalize deviation from setpoint + energy use."""
return -(5.0 * (T_zone_K - T_set_K)**2 + u**2)
# Single continuous simulation (no reset between days)
tid = requests.post(f'{url}/testcases/{testcase}/select').json()['testid']
requests.put(f'{url}/initialize/{tid}', json={'start_time': 0, 'warmup_period': 0})
requests.put(f'{url}/step/{tid}', json={'step': 900})
y = requests.post(f'{url}/advance/{tid}', json={}).json()['payload']
ppo_rewards = []
for day in range(n_days):
sigma = sigma_start - (sigma_start - sigma_end) * day / max(n_days - 1, 1)
# --- Phase 1: Rollout for one day (continues from where we left off) ---
rollout = []
for t in range(96):
T_zone = y['reaTZon_y']
fc = requests.put(f'{url}/forecast/{tid}', json={
'point_names': ['TDryBul', 'HGloHor'],
'horizon': (T - 1) * 900, 'interval': 900
}).json()['payload']
d_fc = np.column_stack([fc['TDryBul'], fc['HGloHor']])
if len(d_fc) < T - 1:
d_fc = np.pad(d_fc, ((0, T - 1 - len(d_fc)), (0, 0)), 'edge')
d_fc_norm = (d_fc[:T-1] - d_mean) / d_std
# MPC action (no grad during rollout)
x_init = torch.tensor([[T_zone]], dtype=torch.float64)
dt = torch.tensor(d_fc_norm.T, dtype=torch.float64)
ft = torch.mm(Bd_ppo.detach(), dt).T.unsqueeze(1)
with torch.no_grad():
_, u_opt, _ = MPC(n_state, n_ctrl, T,
u_lower=0.0, u_upper=1.0, lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc),
LinDx(F_ppo.detach().repeat(T-1, 1, 1, 1), ft))
# Stochastic policy: Gaussian around MPC output
mu = u_opt[0, 0, 0].item()
dist = torch.distributions.Normal(mu, sigma)
action = float(max(0.0, min(1.0, dist.sample().item())))
log_prob_old = dist.log_prob(torch.tensor(action)).item()
y = requests.post(f'{url}/advance/{tid}', json={
'oveHeaPumY_u': action, 'oveHeaPumY_activate': 1
}).json()['payload']
rollout.append((T_zone, d_fc_norm.copy(), action, log_prob_old,
boptest_reward(T_zone, action)))
rewards_arr = np.array([r[4] for r in rollout])
advantages = (rewards_arr - rewards_arr.mean()) / (rewards_arr.std() + 1e-8)
# --- Phase 2: PPO update (local only, no API calls) ---
ppo_loss = torch.tensor(0.0, dtype=torch.float64)
for idx, (T_zone, d_fc_n, action, lp_old, _) in enumerate(rollout):
x_init = torch.tensor([[T_zone]], dtype=torch.float64)
dt = torch.tensor(d_fc_n.T, dtype=torch.float64)
ft = torch.mm(Bd_ppo, dt).T.unsqueeze(1) # ← gradients flow
_, u_opt, _ = MPC(n_state, n_ctrl, T,
u_lower=0.0, u_upper=1.0, lqr_iter=20, verbose=-1,
exit_unconverged=False,
)(x_init, QuadCost(C_mpc, c_mpc),
LinDx(F_ppo.repeat(T-1, 1, 1, 1), ft)) # ← gradients flow
mu_new = u_opt[0, 0, 0]
dist_new = torch.distributions.Normal(mu_new, sigma)
log_prob_new = dist_new.log_prob(torch.tensor(action))
ratio = torch.exp(log_prob_new - lp_old)
adv = torch.tensor(advantages[idx], dtype=torch.float64)
surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - eps_clip, 1 + eps_clip) * adv
ppo_loss = ppo_loss - torch.min(surr1, surr2)
ppo_loss = ppo_loss / len(rollout)
opt_ppo.zero_grad()
ppo_loss.backward()
opt_ppo.step()
ppo_rewards.append(rewards_arr.mean())
print(f"Day {day+1}/{n_days} sigma={sigma:.2f} "
f"avg_reward={rewards_arr.mean():.2f} "
f"F={F_ppo.detach().numpy().round(4)} "
f"Bd={Bd_ppo.detach().numpy().round(4)}")
kpis = requests.get(f'{url}/kpi/{tid}').json()['payload']
print(f"\nCumulative KPIs over {n_days} days: "
f"energy={kpis['ener_tot']:.2f} kWh, discomfort={kpis['tdis_tot']:.4f} Kh")fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(range(1, len(ppo_rewards) + 1), ppo_rewards, 's-')
ax.set_xlabel("Day")
ax.set_ylabel("Average Reward")
ax.set_title("BOPTEST PPO: Average Reward per Day")
ax.axhline(y=ppo_rewards[0], color='k', linestyle='--', alpha=0.4, label="Day 1 baseline")
ax.legend()
plt.tight_layout()
plt.show()The simulation runs continuously — each day picks up where the previous day ended, so the controller experiences changing weather conditions across days. The PPO update at the end of each day adjusts \(F\) and \(B_d\) to improve control for the conditions the agent actually encounters.
The progression is now: toy system (ssM_env) \(\to\) realistic building (BOPTEST) \(\to\) data center (SustainDC). The code structure is identical at each step — only the environment interface changes.
18.10 Connecting to Assignment 3
The environment interface mapping and the keep-vs-replace breakdown are already covered in sections above. This section addresses the two non-obvious design decisions you will face when adapting Gnu-RL for SustainDC.
18.10.1 State vs. Disturbance Decomposition
SustainDC’s Agent_DC (cooling optimizer) observes 26+ dimensions. You need to decide which are states (\(x\), what the controller affects), which are disturbances (\(d\), exogenous inputs), and which to ignore:
| Role | Candidate variables | Notes |
|---|---|---|
| State \(x\) | Room temperature | The variable you are trying to control — analogous to “Indoor Temp.” in Gnu-RL |
| Disturbance \(d\) | Ambient temperature, IT power/workload, carbon intensity forecast | Things the cooling controller cannot change but needs to anticipate |
| Ignore (initially) | Battery SoC, workload scheduling variables | Belong to the other two agents; not relevant if you fix them to baselines |
This gives you a small state-space model (\(n_{state} = 1\), \(n_{dist} = 3\text{--}5\)) similar in size to Gnu-RL’s original formulation. You can always add more state dimensions later if the model is too simple.
18.10.2 Continuous vs. Discrete Actions
Gnu-RL’s MPC produces continuous control actions (e.g., supply air temperature in \([20, 65]°C\)). SustainDC’s Agent_DC uses a Discrete(3) action space: decrease / maintain / increase CRAH setpoint.
You have two options:
Discretize the MPC output: Run the MPC with continuous bounds, then map the result to the closest discrete action (e.g., \(u^* > \text{threshold} \to\) increase, \(u^* < -\text{threshold} \to\) decrease, else maintain). Simple, but the discretization is not differentiable.
Bypass the discrete wrapper: Access SustainDC’s underlying continuous setpoint directly (the discrete actions just increment/decrement an internal setpoint variable). This preserves differentiability but requires modifying how you interact with the environment.
Either approach is acceptable for the assignment. Document your choice and its trade-offs.
18.10.3 Checklist
Before starting Assignment 3, confirm you can: