17 Gnu-RL: A Precocial Reinforcement Learning Solution for Building HVAC Control Using a Differentiable MPC Policy
17.1 Overview
17.2 Review of the paper
17.2.1 Summary
Gnu-RL (named after the precocial African herbivore) proposes a reinforcement learning agent that is practical to deploy for building HVAC control. The key insight is that replacing the neural network policy with a Differentiable MPC policy yields an agent that is sample-efficient, interpretable, and capable of good performance from the moment of deployment.
Key contributions:
Precocial RL agent: Unlike standard RL agents that require millions of interaction steps (47.5 simulated years in one comparison), Gnu-RL is pre-trained on historical data from existing controllers via imitation learning and performs well immediately upon deployment.
Differentiable MPC policy: The policy solves an MPC optimization problem in its forward pass and backpropagates through the KKT conditions of that optimization in its backward pass. The learnable parameters are the linear dynamics matrices \(\theta = \{A, B_u, B_d\}\)—far fewer parameters than a neural network, with clear physical interpretation.
Two-phase training:
- Phase 1 (Imitation Learning): The agent learns from historical state-action pairs logged by an existing controller, jointly fitting both the dynamics model and the control behavior by minimizing a combined loss: \(\mathcal{L}_{imit}(\theta) = \sum_t \lambda(x_t - \hat{x}_t)^2 + (u_t - \hat{u}_t)^2\), where \(\lambda\) balances state prediction vs. action matching.
- Phase 2 (Online PPO): The pre-trained agent is deployed and continues to improve by maximizing expected reward via policy gradient updates.
Imitation learning outperforms system identification: The paper compares two initialization schemes and shows that imitation learning produces agents that track setpoints well from day one, while system identification (PEM) yields good prediction error but poor control performance—demonstrating that small prediction error does not imply good control.
Simulation results: On an EnergyPlus model of a 600 \(m^2\) office building (Carnegie Mellon’s Intelligent Workspace), Gnu-RL saved 6.6% energy compared to the best published RL agent while maintaining better occupant comfort (lower PPD).
Real-world deployment: Deployed for three weeks in a 20 \(m^2\) conference room on CMU’s campus, controlling a VAV box. Gnu-RL saved 16.7% of cooling demand compared to the existing fixed-schedule controller while achieving significantly better temperature setpoint tracking (RMSE of \(1.02°F\) vs. \(2.4°F\)).
17.2.2 What do we know already?
This paper connects deeply to several concepts covered in the course:
From Lectures 3–5 (Thermal Dynamics of Buildings):
- The linear state-space model \(x_{t+1} = Ax_t + B_u u_t + B_d d_t\) used by Gnu-RL is exactly the type of discrete-time thermal dynamics model we studied. The matrices \(A\), \(B_u\), \(B_d\) capture how zone temperatures evolve based on current state, control actions, and external disturbances.
- The assumption of local linearity around operating points—building dynamics are nonlinear globally but approximately linear in the region of normal operation—is the same simplification used in our thermal RC network models.
From Lectures 8–9 (Control Theory, PID and MPC):
- Gnu-RL’s Differentiable MPC policy directly builds on the MPC formulation covered in class: minimizing a cost function over a receding planning horizon subject to dynamics constraints and input bounds.
- The cost function \(C_t(\tau_t) = \frac{\eta}{2}\sum(x_{t,i} - x_{i,setpoint})^2 + \sum|u_{t,i}|\) balances comfort (state deviation from setpoint) against energy (control effort), with \(\eta\) weighting their relative importance—the same trade-off we discussed in the MPC lecture.
- The comparison against P-controllers in both experiments connects to the PID control concepts from Lecture 8.
From Lecture 7 (Thermal Comfort):
- The paper uses Predicted Percentage Dissatisfied (PPD) as a comfort metric in the simulation experiment, directly connecting to the thermal comfort models we studied.
- The use of different \(\eta\) values for occupied (\(\eta = 3\)) vs. unoccupied (\(\eta = 0.1\)) periods reflects the occupancy-aware comfort management concepts from Lecture 7.
From Paper 2 (PyDCM/SustainDC):
- Paper 2 already introduced RL fundamentals (MDPs, PPO, policy gradient methods) and the Gymnasium interface. Those concepts apply directly here.
- Paper 2’s “Connecting to Gnu-RL” section previewed the Differentiable MPC policy, the two-phase training, and how to adapt Gnu-RL for data center cooling. This paper provides the full technical details behind that preview.
- The key contrast: Paper 2’s SustainDC uses standard neural network policies (MAPPO, HAPPO, etc.) that require extensive simulation, while Gnu-RL uses a structured MPC policy that needs only historical data from an existing controller.
To fully understand this paper’s contributions, students may need background on:
Differentiable optimization:
- How to differentiate through the solution of an optimization problem
- Karush-Kuhn-Tucker (KKT) conditions for constrained optimization
- The OptNet framework for embedding optimization as a neural network layer
- Automatic differentiation (backpropagation through the MPC solver)
Imitation learning:
- Learning from expert demonstrations (behavioral cloning)
- The distinction between imitation learning and system identification: imitation learning matches both dynamics and actions, while system identification only matches dynamics
- Why good prediction error does not guarantee good control performance
Model Predictive Control details:
- Receding horizon control: solve over a planning horizon \(T\), apply only the first action, re-plan at the next step
- The role of the planning horizon length (12 steps = 3 hours in the simulation)
- How input constraints (\(\underline{u} \leq u \leq \overline{u}\)) are handled in the optimization
- Re-parameterization of the policy output as a Gaussian distribution around the MPC solution for use with policy gradient methods
System identification vs. imitation learning:
- Prediction Error Methods (PEM) for estimating dynamic system parameters
- Why system identification requires excitation signals that may disturb normal operation
- The paper’s key finding that imitation learning is superior for control initialization
Real-world deployment challenges:
- Communication delays with the Building Automation System (BAS)
- Sensor noise and equipment issues (e.g., the reheat coil leakage discovered during deployment)
- Adapting to discrepancies between expected and actual disturbances (e.g., occupancy sensor counting errors)
17.3 Methods
The paper employs several technical methods that should be expanded upon in class discussion:
17.3.1 1. Differentiable MPC Policy
The core innovation replaces the neural network policy \(\pi_\theta(u|x)\) with an MPC optimization layer:
- Forward pass: Solves a constrained quadratic program to find the optimal control trajectory \(\tau^*_{1:T} = \{x_t^*, u_t^*\}_{1:T}\) that minimizes the cost function subject to linear dynamics and input constraints
- Backward pass: Computes gradients \(\nabla_\theta \mathcal{L}\) by differentiating through the KKT conditions of the optimization problem, using techniques from OptNet
- Learnable parameters: \(\theta = \{A, B_u, B_d\}\)—the state-space dynamics matrices. The cost function parameters (e.g., \(\eta\)) can also be learned but were fixed in the experiments
- Action selection: Only the first optimal action \(u_t^*\) is applied (receding horizon), and the policy is re-parameterized as \(\hat{u}_t \sim \mathcal{N}(u_t^*, \sigma^2)\) for compatibility with policy gradient methods
17.3.2 2. Linear State-Space Dynamics Model
\[x_{t+1} = f_\theta(\tau_t) = Ax_t + B_u u_t + B_d d_t\]
- Assumes local linearity of building thermal dynamics around operating conditions
- Disturbance terms \(d_t\) include weather variables and occupancy, provided over the planning horizon \(d_{t:t+T-1}\)
- The small number of free parameters makes the model interpretable—engineers can inspect \(A\), \(B_u\), \(B_d\) to verify physical plausibility
17.3.3 3. Cost Function Design
\[C_t(\tau_t) = \frac{\eta}{2}\sum_{i=1}^{\#states}(x_{t,i} - x_{i,setpoint})^2 + \sum_{i=1}^{\#actions}|u_{t,i}|\]
- \(\eta\) balances comfort (L2-norm of state deviation from setpoint) against energy (L1-norm of control actions)
- Different \(\eta\) values for occupied (\(\eta = 3\)) vs. unoccupied (\(\eta = 0.1\)) periods encode the insight that comfort matters more when people are present
- The cost function is specified by the engineer, not learned—this is where domain knowledge enters beyond the dynamics
17.3.4 4. Imitation Learning (Algorithm 1)
- Collects state-action demonstrations \((X, U)\) from an existing controller’s historical data
- Jointly minimizes state prediction loss and action matching loss: \(\mathcal{L}_{imit}(\theta) = \sum_t \lambda(x_t - \hat{x}_t)^2 + (u_t - \hat{u}_t)^2\)
- Hyperparameter \(\lambda\) balances dynamics learning vs. behavioral cloning; can be increased if the existing controller’s actions are of low quality
- Uses RMSprop optimizer with learning rate \(1 \times 10^{-4}\)
17.3.5 5. Online Learning with PPO (Algorithm 2)
- After pre-training, deploys the agent and continues training with Proximal Policy Optimization
- PPO loss: \(\mathcal{L}_{PPO}(\theta) = -\hat{\mathbb{E}}_t[\min(w_t(\theta)\hat{A}_t, \text{clip}(w_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]\)
- Gradients flow through the MPC solver to update the dynamics parameters \(\theta\)
- The paper demonstrates that maximizing expected reward (PPO objective) produces better controllers than minimizing prediction error (adaptive MPC objective), even though the latter yields smaller state prediction errors
17.3.6 6. Comparison: Policy Gradient vs. Prediction Error Minimization
- Adaptive MPC (\(\mathcal{L}_{PEM}\)): Updates parameters online to minimize \(\sum_t (x_t - \hat{x}_t)^2\)—i.e., improve the model’s predictive accuracy
- Gnu-RL (\(\mathcal{L}_{PPO}\)): Updates parameters to maximize expected reward—i.e., improve actual control performance
- Key finding: \(\mathcal{L}_{PPO}\) consistently yields larger residual reward (better control) despite \(\mathcal{L}_{PEM}\) achieving smaller prediction error. This demonstrates that prediction accuracy is only a proxy for control quality, and directly optimizing the task objective is more effective.
17.4 Hands-On Tutorial
The concepts covered above — differentiable MPC, imitation learning, and online PPO refinement — are explored in depth through a companion tutorial:
- Hands-On: Getting Started with Gnu-RL: Walks through the Gnu-RL codebase and its Differentiable MPC module, imitation learning and online PPO training pipelines, and interpretation of the original authors’ pre-computed results.