14 PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability
14.1 Overview
14.2 Review of the paper
14.2.1 Summary
PyDCM (the primary paper) introduces a modular, Python-native data center simulation model that replaces the traditional reliance on EnergyPlus or Modelica for data center thermal modeling. Key advantages include:
- Speed: PyDCM is 30–40x faster than EnergyPlus for data center simulations due to vectorized thermal calculations and in-place operations, with sub-linear scaling as CPUs increase.
- Customizability: Users can configure individual server specifications, cabinet arrangements (rows, racks, CPUs per rack), HVAC component parameters, and airflow containment strategies via a simple JSON configuration file (
dc_config.json). - RL integration: PyDCM wraps the simulation as an OpenAI Gymnasium environment, enabling direct use with standard RL libraries without cross-platform communication overhead.
- Reduced resource usage: Lower memory (16.84 GB vs. 18.20 GB) and CPU utilization (18.21% vs. 20.64%) compared to EnergyPlus.
SustainDC (the companion paper, NeurIPS 2024) builds on PyDCM to create a comprehensive multi-agent RL benchmarking platform for data center operations. It defines three interconnected Gymnasium environments:
- Workload Environment (\(Env_{LS}\)): Manages scheduling of delay-tolerant workloads using real traces from Alibaba and Google data centers.
- Data Center Environment (\(Env_{DC}\)): Simulates the thermo-fluid dynamics of the IT room and HVAC cooling system, with automatic chiller “sizing” based on workload and configuration.
- Battery Environment (\(Env_{BAT}\)): Models on-site battery charging/discharging to shift energy consumption away from high grid carbon intensity periods.
SustainDC benchmarks several MARL algorithms (IPPO, MAPPO, HAPPO, HAA2C, HAD3QN, HASAC) across multiple US locations, demonstrating that multi-agent approaches outperform single-agent baselines on carbon footprint, energy consumption, and water usage metrics.
14.2.2 What do we know already?
This paper connects to several concepts covered in earlier lectures:
From Lecture 2 (Why Buildings):
- The energy and carbon motivation: data centers consume massive amounts of energy (a moderately sized DC can use up to 100x the energy of a similarly sized office space), making them prime targets for energy efficiency improvements
- The Sense/Plan/Act framework—PyDCM/SustainDC provides the simulation environment needed to develop and test the “Plan” component (RL-based controllers)
From Lectures 3–5 (Thermal Dynamics of Buildings):
- The heat transfer and thermodynamic modeling concepts directly apply here: data center thermal models capture conduction, convection, and heat exchange between IT equipment, CRAH units, chillers, and cooling towers
- The concept of thermal zones maps to the IT room zones, cabinet-level temperature distributions, and supply/return air temperatures
- CFD-derived approach temperatures are used to simplify the complex 3D thermal dynamics into a computationally tractable model
From Lecture 6 (Thermal Comfort):
- While data centers don’t have human occupants to keep comfortable, the servers have strict thermal operating envelopes. The cooling setpoint optimization problem is analogous to comfort-constrained HVAC control—instead of PMV/PPD bounds, we have maximum allowable server inlet temperatures
From Lecture 7 (AC Power):
- Understanding of electrical power consumption models for IT equipment and HVAC components
- The grid carbon intensity concept—how the carbon content of electricity varies with time, location, and generation mix
From Lectures 8–9 (Control Theory):
- The fundamentals of feedback control apply directly to the CRAH setpoint control problem
- The paper compares RL-based controllers against rule-based baselines (e.g., ASHRAE Guideline 36), connecting to the classical control approaches covered in class
- The multi-agent formulation extends single-loop control to coordinated multi-objective optimization
To fully understand these papers’ contributions, students may need background on:
Data center architecture and cooling systems:
- IT room layout: rows, cabinets, servers, and airflow containment strategies (cold aisle, hot aisle, open)
- HVAC components specific to data centers: Computer Room Air Handlers (CRAH), chillers, pumps, cooling towers
- Heat rejection chain: server → CRAH → chiller → cooling tower → external environment
- Power Usage Effectiveness (PUE) and other data center efficiency metrics
Reinforcement learning fundamentals:
- Markov Decision Processes (MDPs): states, actions, rewards, transitions
- Policy gradient methods, specifically Proximal Policy Optimization (PPO)
- The OpenAI Gymnasium interface (
reset,step,init) and how simulation environments are wrapped for RL - Reward shaping and its impact on learned behavior
Multi-agent reinforcement learning (MARL):
- Independent vs. centralized training paradigms
- Independent PPO (IPPO) vs. Multi-Agent PPO (MAPPO) vs. heterogeneous methods (HAPPO, HAA2C, HAD3QN, HASAC)
- Collaborative reward structures and the \(\alpha\) weighting parameter for reward sharing
- Challenges of heterogeneous action and observation spaces
Grid carbon intensity and carbon-aware computing:
- How grid carbon intensity varies by location, time of day, and season
- Carbon-aware workload scheduling: shifting computation to low-CI periods
- Battery storage for energy arbitrage and carbon footprint reduction
Computational Fluid Dynamics (CFD) for data centers:
- How CFD simulations are used to precompute approach temperatures (difference between CRAH supply temperature and server inlet temperature)
- Simplifying 3D thermal dynamics into reduced-order models suitable for real-time control
14.3 Methods
The papers employ several technical methods that should be expanded upon in class discussion:
14.3.1 1. Data Center Thermal Modeling
- IT power model: Computes power consumption for each CPU and fan based on utilization and inlet temperature, with configurable power curves (idle power, rated full load power, rated full load frequency)
- HVAC model: Hierarchical model of CRAH, chiller, pump, and cooling tower, each with configurable parameters. Energy consumption is computed based on thermal load and component characteristics
- CFD-derived approach temperatures: Precomputed temperature offsets between CRAH supply air and actual server inlet temperatures, capturing the 3D airflow patterns without running CFD at every timestep
- Automatic chiller sizing: HVAC cooling capacities are automatically adjusted based on workload demands and IT room configurations
14.3.2 2. Reinforcement Learning Formulation
State spaces: Each agent observes different variables—e.g., the cooling agent sees time of day, dry-bulb temperature, room temperature, previous energy usage, and forecasted grid CI
Action spaces: All three agents use
Discrete(3). The specific mappings are:Agent Action 0 Action 1 Action 2 Do-Nothing \(Agent_{LS}\) Defer shiftable tasks Process normally Process deferred queue 1 \(Agent_{DC}\) Decrease CRAH setpoint Maintain setpoint Increase setpoint 1 \(Agent_{BAT}\) Charge battery Discharge battery Idle (no charge/discharge) 2 Note that the do-nothing action is 1 for the workload and cooling agents but 2 for the battery agent—easy to get wrong when writing baselines.
Reward design: Default reward is negative carbon footprint (\(CFP_t = (E_{hvac} + E_{it} + E_{bat}) \times CI_t\)), with customizable alternatives including energy consumption, operating costs, and water usage
Collaborative reward sharing: Weighted combination where each agent receives \(\alpha\) of its own reward plus \((1-\alpha)/2\) from each of the other two agents
14.3.3 3. Multi-Agent Coordination
- Heterogeneous MARL: Different agents have different observation spaces, action spaces, and reward structures, requiring heterogeneous multi-agent methods
- Sequential environment stepping: At each timestep, the workload agent acts first (adjusting the compute load), then the DC cooling agent (setting the CRAH setpoint given the adjusted workload), then the battery agent (deciding charge/discharge given total energy consumption)
- Benchmarked algorithms: IPPO, MAPPO (centralized critic), HAPPO, HAA2C, HAD3QN, HASAC—spanning on-policy and off-policy, homogeneous and heterogeneous approaches
14.3.4 4. Simulation Performance Optimization
- Vectorized computations: NumPy-based vectorized thermal and power calculations instead of EnergyPlus’s sequential approach
- In-place operations: Minimizing memory allocation overhead during simulation steps
- Efficient reset: Fast environment reset for RL training loops (99.99% reduction in reset time vs. EnergyPlus)
- Sub-linear scaling: Simulation time scales sub-linearly with the number of CPUs, enabling hyper-scale DC modeling
14.4 Data Center Cooling Primer
Before diving into the codebase, it helps to understand the physical system being modeled.
14.4.1 The Heat Rejection Chain
A data center’s cooling system removes heat generated by IT equipment through a series of stages:
CPU/GPU → Server Fans → Rack (hot aisle) → CRAH Unit → Chilled Water Loop → Chiller → Cooling Tower → Atmosphere
Each stage has an associated energy cost:
- IT equipment generates heat proportional to its power consumption. A server drawing 300W converts nearly all of that to heat.
- Server fans push air through the rack, moving heat from the chip to the hot aisle. Fan power increases with temperature (more cooling needed → faster fans).
- Computer Room Air Handlers (CRAHs) draw hot air from the hot aisle, cool it via a chilled water coil, and supply cold air to the cold aisle. The CRAH fan consumes significant power.
- Chillers cool the water circulating through the CRAHs. Chiller efficiency is measured by the Coefficient of Performance (COP)—the ratio of cooling provided to electricity consumed. A COP of 5.0 means 5 kW of cooling per 1 kW of electricity.
- Cooling towers reject heat from the chiller condenser loop to the atmosphere via evaporative cooling, consuming water and pump energy.
14.4.2 Key Metrics
Power Usage Effectiveness (PUE) is the standard efficiency metric for data centers:
\[PUE = \frac{E_{total}}{E_{IT}} = \frac{E_{IT} + E_{cooling} + E_{other}}{E_{IT}}\]
A PUE of 1.0 is the theoretical ideal (all energy goes to computation). Typical values range from 1.2 (efficient) to 2.0+ (inefficient). The cooling system is the largest contributor to the gap between PUE and 1.0.
ASHRAE Thermal Guidelines define allowable server inlet temperature ranges. The recommended range is 18–27°C, with an allowable range extending to 15–32°C for short periods. Operating at higher setpoints saves cooling energy but risks thermal throttling or hardware damage.
14.4.3 Approach Temperatures and CFD
In a real data center, the temperature at each server’s inlet differs from the CRAH supply temperature due to airflow mixing, recirculation, and containment effectiveness. The approach temperature captures this offset:
\[T_{inlet,rack} = T_{CRAH,supply} + \Delta T_{approach,rack}\]
PyDCM uses precomputed CFD results to set these approach temperatures per rack, avoiding the need to run expensive 3D fluid simulations at every timestep while still capturing spatial non-uniformity.
14.5 Reinforcement Learning Foundations
14.5.1 Markov Decision Processes (MDPs)
An RL problem is formalized as an MDP defined by the tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):
- \(\mathcal{S}\): State space — the set of all possible observations the agent can see
- \(\mathcal{A}\): Action space — the set of all actions the agent can take
- \(P(s'|s,a)\): Transition function — the probability of reaching state \(s'\) after taking action \(a\) in state \(s\)
- \(R(s,a)\): Reward function — the immediate reward for taking action \(a\) in state \(s\)
- \(\gamma \in [0,1]\): Discount factor — how much the agent values future vs. immediate rewards
The agent’s goal is to learn a policy \(\pi(a|s)\) that maximizes the expected cumulative discounted reward:
\[J(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]\]
14.5.2 Mapping MDPs to Data Center Cooling
In the SustainDC cooling environment (dc_gym.py), the MDP components are:
| MDP Component | Data Center Cooling |
|---|---|
| State \(s_t\) | Ambient temperature, CRAH setpoint, zone air temperature, HVAC power, IT power (5-dim vector, normalized) |
| Action \(a_t\) | Discrete: {decrease setpoint, maintain setpoint, increase setpoint} |
| Reward \(r_t\) | \(-CFP_t = -(E_{hvac} + E_{it}) \times CI_t\) (negative carbon footprint) |
| Transition | PyDCM thermal model steps forward one timestep |
| Episode | One year of operation (8,760 hourly steps) |
14.5.3 The OpenAI Gymnasium Interface
The individual sub-environments (dc_gym.py, ls_gym.py, bat_gym.py) follow the standard Gymnasium API:
import gymnasium as gym
# Create the environment
env = gym.make("dc_gym-v0", config=config)
# Reset to initial state
obs, info = env.reset()
# Run one episode
done = False
while not done:
action = agent.select_action(obs) # your policy
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncatedThe key methods:
reset()→ returns initial observation and info dict. In PyDCM this is ~100x faster than EnergyPlus because there are no IDF files to recompile.step(action)→ advances simulation by one timestep, returns(observation, reward, terminated, truncated, info). PyDCM’s vectorized computation makes this ~30x faster than EnergyPlus.observation_spaceandaction_space→ define the valid ranges for observations and actions.
The SustainDC multi-agent wrapper (sustaindc_env.py) was designed for the HARL training framework and deviates from the standard Gymnasium API in several ways:
reset()returns only the observation dict—not the(obs, info)tuple you would expect. The info dict is stored internally onself.infos.observation_spaceandaction_spaceare lists (one entry per active agent), not dicts keyed by agent name. The ordering follows the"agents"config list.step()returns the standard 5-tuple(obs, rewards, terminated, truncated, info), but all values are dicts keyed by agent name (e.g.,"agent_dc").- Episode end uses truncation, not termination. The wrapper sets
truncateds["__all__"] = Truebut leavesterminateds["__all__"] = False. Your simulation loop must check both:done = terminated.get("__all__", False) or truncated.get("__all__", False). If you only checkterminated, the loop will never exit and will eventually crash when data managers run past the end of their arrays. - The
monthparameter is required. It is not inEnvConfig’s defaults, but if omitted,self.monthisNoneand the environment crashes. The value is 0-indexed:0= January,11= December.
See the SustainDC hands-on tutorial for working code examples.
14.5.4 Policy Gradient Methods and PPO
Policy gradient methods directly optimize the policy \(\pi_\theta(a|s)\) (parameterized by \(\theta\)) by computing gradients of the expected reward with respect to \(\theta\):
\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A_t\right]\]
where \(A_t\) is the advantage function — how much better action \(a_t\) is compared to the average action in state \(s_t\).
Proximal Policy Optimization (PPO) is the most widely used policy gradient algorithm. Its key idea is to prevent destructively large policy updates by clipping the objective:
\[L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\]
where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio between the new and old policies, and \(\epsilon\) (typically 0.2) limits how far the new policy can deviate from the old one in a single update.
PPO is the base algorithm for most of SustainDC’s multi-agent methods (IPPO, MAPPO, HAPPO all build on PPO).
14.5.5 Reward Shaping
The choice of reward function fundamentally determines what the agent learns. SustainDC provides several options via utils/reward_creator.py:
| Reward Function | What It Optimizes |
|---|---|
default_dc_reward |
Negative carbon footprint: \(-(E_{hvac} + E_{it}) \times CI_t\) |
tou_reward |
Time-of-use electricity cost |
energy_PUE_reward |
Power Usage Effectiveness |
water_usage_efficiency_reward |
Cooling tower water consumption |
temperature_efficiency_reward |
Thermal constraint satisfaction |
custom_agent_reward |
User-defined (template provided) |
Using carbon footprint as the reward means the agent will learn to cool less aggressively during high-CI periods (when the grid is dirty) and more aggressively during low-CI periods. This is desirable for sustainability but may conflict with thermal safety. In practice, you often need to combine multiple objectives or add constraint penalties.
14.6 Hands-On Tutorials
The concepts covered above — data center thermal modeling, the Gymnasium interface, PPO, and multi-agent coordination — are explored in depth through two companion tutorials:
Getting Started with PyDCM: Uses the
pydcmbranch (BuildSys 2023). Covers environment setup, the Gymnasium simulation loop, benchmarking PyDCM’s speed advantage, and training a single-agent PPO controller for the cooling task.Getting Started with SustainDC: Uses the
mainbranch (NeurIPS 2024). Covers the full multi-agent framework with three coordinated agents, rule-based baseline controllers, HARL training (HAPPO, MAPPO, etc.), and evaluation using the five SustainDC metrics.
Start with the PyDCM tutorial (simpler, single-agent) before moving to SustainDC (multi-agent with more configuration options).
14.7 Connecting to Gnu-RL (Paper 3 Preview)
Assignment 3 asks you to apply the Gnu-RL algorithm (Paper 3) to the SustainDC environment. Gnu-RL uses a fundamentally different policy architecture from the standard neural network policies in MAPPO/HAPPO.
14.7.1 The Differentiable MPC Policy
Instead of a neural network mapping observations to actions, Gnu-RL uses a Differentiable Model Predictive Control (MPC) layer as the policy. At each timestep, the policy solves an optimization problem:
\[\min_{u_t, \ldots, u_{t+T-1}} \sum_{k=t}^{t+T-1} \left[ \frac{\eta}{2} \|x_k - x_{setpoint}\|^2 + \|u_k\| \right]\]
subject to:
\[x_{k+1} = A x_k + B_u u_k + B_d d_k, \quad \underline{u} \leq u_k \leq \overline{u}\]
where:
- \(x_k\): state (zone temperatures)
- \(u_k\): control action (cooling setpoint adjustment)
- \(d_k\): disturbances (weather, workload)
- \(T\): planning horizon (e.g., 3 hours ahead)
- \(\eta\): weight balancing comfort vs. energy
- \(A, B_u, B_d\): learnable system dynamics parameters
The key insight is that this optimization problem is differentiable with respect to \(A\), \(B_u\), \(B_d\), and \(\eta\). This means we can backpropagate through the MPC solver to learn the dynamics model end-to-end.
14.7.2 Why This Matters
Compared to standard neural network policies (PPO, SAC, etc.), the Differentiable MPC policy:
| Property | Neural Network Policy | Differentiable MPC Policy |
|---|---|---|
| Parameters | Thousands–millions of weights | Only \(A\), \(B_u\), \(B_d\) (a few dozen) |
| Sample efficiency | Needs many episodes | Learns from limited data |
| Interpretability | Black box | Learned dynamics are inspectable |
| Domain knowledge | None encoded | Planning horizon, constraints, cost structure |
| Pre-training | Random initialization | Imitation learning from baseline controller |
14.7.3 Adapting for Data Center Cooling
To apply Gnu-RL to the SustainDC DC cooling environment, you need to define:
State vector \(x_t\) (what temperatures to track):
- Zone air temperature (IT room average)
- CRAH supply air temperature
Control action \(u_t\):
- CRAH setpoint adjustment (maps to the discrete actions in
dc_gym.py, or can be relaxed to continuous)
Disturbance vector \(d_t\) (uncontrollable external inputs):
- Outdoor dry-bulb temperature (from weather data)
- IT workload / CPU utilization (from workload traces)
- Grid carbon intensity (if included in the cost function)
Linear dynamics model:
\[\underbrace{\begin{bmatrix} T_{zone} \\ T_{CRAH} \end{bmatrix}}_{x_{t+1}} = \underbrace{A}_{2\times2} \underbrace{\begin{bmatrix} T_{zone} \\ T_{CRAH} \end{bmatrix}}_{x_t} + \underbrace{B_u}_{2\times1} \underbrace{[\Delta T_{setpoint}]}_{u_t} + \underbrace{B_d}_{2\times3} \underbrace{\begin{bmatrix} T_{outdoor} \\ W_{load} \\ CI \end{bmatrix}}_{d_t}\]
14.7.4 Two-Phase Training
Gnu-RL trains in two phases:
Phase 1: Imitation Learning (offline)
- Run the baseline controller for 3+ months of simulated time to collect state-action pairs \((x_t, u_t, d_t)\)
- Initialize \(A\), \(B_u\), \(B_d\) randomly
- Minimize a combined loss:
\[\mathcal{L} = \lambda \sum_t \|x_{t+1} - \hat{x}_{t+1}\|^2 + (1-\lambda) \sum_t \|u_t - \hat{u}_t\|^2\]
where \(\hat{x}_{t+1}\) and \(\hat{u}_t\) are the model’s predictions, and \(\lambda\) balances state prediction accuracy vs. action matching.
Phase 2: Online Learning (policy gradient refinement)
- Deploy the pre-trained agent in the SustainDC environment
- Continue training with PPO, backpropagating through the MPC solver
- The agent fine-tunes \(A\), \(B_u\), \(B_d\) to improve actual performance (not just imitation)
The original Gnu-RL was designed for building HVAC (slow dynamics, 5–15 min timesteps, 3-hour planning horizon). Data center thermal dynamics are faster (1–5 min timesteps) because server rooms have less thermal mass than buildings. You may need to adjust the planning horizon and timestep accordingly. Additionally, data centers have better instrumentation than typical buildings, so the state observations tend to be more reliable.