14 PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability

14.1 Overview

Citation

Primary paper:

Avisek Naug, Antonio Guillen, Ricardo Luna Gutiérrez, Vineet Gundecha, Dejan Markovikj, Lekhapriya Dheeraj Kashyap, Lorenz Krause, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, and Soumyendu Sarkar. 2023. PyDCM: Custom Data Center Models with Reinforcement Learning for Sustainability. In The 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys ’23), November 15–16, 2023, Istanbul, Turkey. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3600100.3623732

Companion paper (extended version):

Avisek Naug, Antonio Guillen, Ricardo Luna, Vineet Gundecha, Cullen Bash, Sahand Ghorbanpour, Sajad Mousavi, Ashwin Ramesh Babu, Dejan Markovikj, Lekhapriya D Kashyap, Desik Rengarajan, and Soumyendu Sarkar. 2024. SustainDC: Benchmarking for Sustainable Data Center Control. In 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks.

Sense, Plan, Act Mapping and Objective

Application domain: Sustainable data center energy management and cooling control. Data centers are among the fastest-growing energy consumers globally, driven by the exponential increase in AI/ML workloads. Optimizing their operation—workload scheduling, cooling setpoints, and battery storage—can significantly reduce energy consumption and carbon footprint.

Sense: PyDCM and SustainDC model the sensing of key data center state variables: IT room temperatures (supply and return air), server CPU utilization and power consumption, HVAC component energy usage, external weather conditions, grid carbon intensity (CI), and battery state of charge. These measurements form the observation/state space for the RL agents. The models incorporate precomputed Computational Fluid Dynamics (CFD) results to capture temperature distributions.
Plan: This is the primary innovation area. PyDCM provides a fast, customizable Python-based simulation environment for data center thermal and electrical dynamics, enabling rapid prototyping of RL-based control strategies. SustainDC extends this into a full multi-agent reinforcement learning (MARL) benchmark with three coordinated agents: a workload scheduler (\(Agent_{LS}\)), a cooling optimizer (\(Agent_{DC}\)), and a battery manager (\(Agent_{BAT}\)). The agents learn policies to jointly minimize carbon footprint (\(CFP\)) over a planning horizon.
Act: The control actions include: (1) scheduling delay-tolerant workloads to periods of lower grid carbon intensity, (2) adjusting CRAH cooling setpoints to optimize the trade-off between cooling energy and server efficiency, and (3) charging/discharging on-site battery banks to shift grid energy consumption away from high-CI periods.

14.2 Review of the paper

14.2.1 Summary

PyDCM (the primary paper) introduces a modular, Python-native data center simulation model that replaces the traditional reliance on EnergyPlus or Modelica for data center thermal modeling. Key advantages include:

Speed: PyDCM is 30–40x faster than EnergyPlus for data center simulations due to vectorized thermal calculations and in-place operations, with sub-linear scaling as CPUs increase.
Customizability: Users can configure individual server specifications, cabinet arrangements (rows, racks, CPUs per rack), HVAC component parameters, and airflow containment strategies via a simple JSON configuration file (dc_config.json).
RL integration: PyDCM wraps the simulation as an OpenAI Gymnasium environment, enabling direct use with standard RL libraries without cross-platform communication overhead.
Reduced resource usage: Lower memory (16.84 GB vs. 18.20 GB) and CPU utilization (18.21% vs. 20.64%) compared to EnergyPlus.

SustainDC (the companion paper, NeurIPS 2024) builds on PyDCM to create a comprehensive multi-agent RL benchmarking platform for data center operations. It defines three interconnected Gymnasium environments:

Workload Environment (\(Env_{LS}\)): Manages scheduling of delay-tolerant workloads using real traces from Alibaba and Google data centers.
Data Center Environment (\(Env_{DC}\)): Simulates the thermo-fluid dynamics of the IT room and HVAC cooling system, with automatic chiller “sizing” based on workload and configuration.
Battery Environment (\(Env_{BAT}\)): Models on-site battery charging/discharging to shift energy consumption away from high grid carbon intensity periods.

SustainDC benchmarks several MARL algorithms (IPPO, MAPPO, HAPPO, HAA2C, HAD3QN, HASAC) across multiple US locations, demonstrating that multi-agent approaches outperform single-agent baselines on carbon footprint, energy consumption, and water usage metrics.

14.2.2 What do we know already?

This paper connects to several concepts covered in earlier lectures:

From Lecture 2 (Why Buildings):

The energy and carbon motivation: data centers consume massive amounts of energy (a moderately sized DC can use up to 100x the energy of a similarly sized office space), making them prime targets for energy efficiency improvements
The Sense/Plan/Act framework—PyDCM/SustainDC provides the simulation environment needed to develop and test the “Plan” component (RL-based controllers)

From Lectures 3–5 (Thermal Dynamics of Buildings):

The heat transfer and thermodynamic modeling concepts directly apply here: data center thermal models capture conduction, convection, and heat exchange between IT equipment, CRAH units, chillers, and cooling towers
The concept of thermal zones maps to the IT room zones, cabinet-level temperature distributions, and supply/return air temperatures
CFD-derived approach temperatures are used to simplify the complex 3D thermal dynamics into a computationally tractable model

From Lecture 6 (Thermal Comfort):

While data centers don’t have human occupants to keep comfortable, the servers have strict thermal operating envelopes. The cooling setpoint optimization problem is analogous to comfort-constrained HVAC control—instead of PMV/PPD bounds, we have maximum allowable server inlet temperatures

From Lecture 7 (AC Power):

Understanding of electrical power consumption models for IT equipment and HVAC components
The grid carbon intensity concept—how the carbon content of electricity varies with time, location, and generation mix

From Lectures 8–9 (Control Theory):

The fundamentals of feedback control apply directly to the CRAH setpoint control problem
The paper compares RL-based controllers against rule-based baselines (e.g., ASHRAE Guideline 36), connecting to the classical control approaches covered in class
The multi-agent formulation extends single-loop control to coordinated multi-objective optimization

Things to learn more about

To fully understand these papers’ contributions, students may need background on:

Data center architecture and cooling systems:
- IT room layout: rows, cabinets, servers, and airflow containment strategies (cold aisle, hot aisle, open)
- HVAC components specific to data centers: Computer Room Air Handlers (CRAH), chillers, pumps, cooling towers
- Heat rejection chain: server → CRAH → chiller → cooling tower → external environment
- Power Usage Effectiveness (PUE) and other data center efficiency metrics
Reinforcement learning fundamentals:
- Markov Decision Processes (MDPs): states, actions, rewards, transitions
- Policy gradient methods, specifically Proximal Policy Optimization (PPO)
- The OpenAI Gymnasium interface (reset, step, init) and how simulation environments are wrapped for RL
- Reward shaping and its impact on learned behavior
Multi-agent reinforcement learning (MARL):
- Independent vs. centralized training paradigms
- Independent PPO (IPPO) vs. Multi-Agent PPO (MAPPO) vs. heterogeneous methods (HAPPO, HAA2C, HAD3QN, HASAC)
- Collaborative reward structures and the \(\alpha\) weighting parameter for reward sharing
- Challenges of heterogeneous action and observation spaces
Grid carbon intensity and carbon-aware computing:
- How grid carbon intensity varies by location, time of day, and season
- Carbon-aware workload scheduling: shifting computation to low-CI periods
- Battery storage for energy arbitrage and carbon footprint reduction
Computational Fluid Dynamics (CFD) for data centers:
- How CFD simulations are used to precompute approach temperatures (difference between CRAH supply temperature and server inlet temperature)
- Simplifying 3D thermal dynamics into reduced-order models suitable for real-time control

14.3 Methods

The papers employ several technical methods that should be expanded upon in class discussion:

14.3.1 1. Data Center Thermal Modeling

IT power model: Computes power consumption for each CPU and fan based on utilization and inlet temperature, with configurable power curves (idle power, rated full load power, rated full load frequency)
HVAC model: Hierarchical model of CRAH, chiller, pump, and cooling tower, each with configurable parameters. Energy consumption is computed based on thermal load and component characteristics
CFD-derived approach temperatures: Precomputed temperature offsets between CRAH supply air and actual server inlet temperatures, capturing the 3D airflow patterns without running CFD at every timestep
Automatic chiller sizing: HVAC cooling capacities are automatically adjusted based on workload demands and IT room configurations

14.3.2 2. Reinforcement Learning Formulation

State spaces: Each agent observes different variables—e.g., the cooling agent sees time of day, dry-bulb temperature, room temperature, previous energy usage, and forecasted grid CI

Action spaces: All three agents use Discrete(3). The specific mappings are:

Agent	Action 0	Action 1	Action 2	Do-Nothing
\(Agent_{LS}\)	Defer shiftable tasks	Process normally	Process deferred queue	1
\(Agent_{DC}\)	Decrease CRAH setpoint	Maintain setpoint	Increase setpoint	1
\(Agent_{BAT}\)	Charge battery	Discharge battery	Idle (no charge/discharge)	2

Note that the do-nothing action is 1 for the workload and cooling agents but 2 for the battery agent—easy to get wrong when writing baselines.

Reward design: Default reward is negative carbon footprint (\(CFP_t = (E_{hvac} + E_{it} + E_{bat}) \times CI_t\)), with customizable alternatives including energy consumption, operating costs, and water usage
Collaborative reward sharing: Weighted combination where each agent receives \(\alpha\) of its own reward plus \((1-\alpha)/2\) from each of the other two agents

14.3.3 3. Multi-Agent Coordination

Heterogeneous MARL: Different agents have different observation spaces, action spaces, and reward structures, requiring heterogeneous multi-agent methods
Sequential environment stepping: At each timestep, the workload agent acts first (adjusting the compute load), then the DC cooling agent (setting the CRAH setpoint given the adjusted workload), then the battery agent (deciding charge/discharge given total energy consumption)
Benchmarked algorithms: IPPO, MAPPO (centralized critic), HAPPO, HAA2C, HAD3QN, HASAC—spanning on-policy and off-policy, homogeneous and heterogeneous approaches

14.3.4 4. Simulation Performance Optimization

Vectorized computations: NumPy-based vectorized thermal and power calculations instead of EnergyPlus’s sequential approach
In-place operations: Minimizing memory allocation overhead during simulation steps
Efficient reset: Fast environment reset for RL training loops (99.99% reduction in reset time vs. EnergyPlus)
Sub-linear scaling: Simulation time scales sub-linearly with the number of CPUs, enabling hyper-scale DC modeling

14.4 Data Center Cooling Primer

Before diving into the codebase, it helps to understand the physical system being modeled.

14.4.1 The Heat Rejection Chain

A data center’s cooling system removes heat generated by IT equipment through a series of stages:

CPU/GPU → Server Fans → Rack (hot aisle) → CRAH Unit → Chilled Water Loop → Chiller → Cooling Tower → Atmosphere

Each stage has an associated energy cost:

IT equipment generates heat proportional to its power consumption. A server drawing 300W converts nearly all of that to heat.
Server fans push air through the rack, moving heat from the chip to the hot aisle. Fan power increases with temperature (more cooling needed → faster fans).
Computer Room Air Handlers (CRAHs) draw hot air from the hot aisle, cool it via a chilled water coil, and supply cold air to the cold aisle. The CRAH fan consumes significant power.
Chillers cool the water circulating through the CRAHs. Chiller efficiency is measured by the Coefficient of Performance (COP)—the ratio of cooling provided to electricity consumed. A COP of 5.0 means 5 kW of cooling per 1 kW of electricity.
Cooling towers reject heat from the chiller condenser loop to the atmosphere via evaporative cooling, consuming water and pump energy.

14.4.2 Key Metrics

Power Usage Effectiveness (PUE) is the standard efficiency metric for data centers:

\[PUE = \frac{E_{total}}{E_{IT}} = \frac{E_{IT} + E_{cooling} + E_{other}}{E_{IT}}\]

A PUE of 1.0 is the theoretical ideal (all energy goes to computation). Typical values range from 1.2 (efficient) to 2.0+ (inefficient). The cooling system is the largest contributor to the gap between PUE and 1.0.

ASHRAE Thermal Guidelines define allowable server inlet temperature ranges. The recommended range is 18–27°C, with an allowable range extending to 15–32°C for short periods. Operating at higher setpoints saves cooling energy but risks thermal throttling or hardware damage.

14.4.3 Approach Temperatures and CFD

In a real data center, the temperature at each server’s inlet differs from the CRAH supply temperature due to airflow mixing, recirculation, and containment effectiveness. The approach temperature captures this offset:

\[T_{inlet,rack} = T_{CRAH,supply} + \Delta T_{approach,rack}\]

PyDCM uses precomputed CFD results to set these approach temperatures per rack, avoiding the need to run expensive 3D fluid simulations at every timestep while still capturing spatial non-uniformity.

14.5 Reinforcement Learning Foundations

14.5.1 Markov Decision Processes (MDPs)

An RL problem is formalized as an MDP defined by the tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):

\(\mathcal{S}\): State space — the set of all possible observations the agent can see
\(\mathcal{A}\): Action space — the set of all actions the agent can take
\(P(s'|s,a)\): Transition function — the probability of reaching state \(s'\) after taking action \(a\) in state \(s\)
\(R(s,a)\): Reward function — the immediate reward for taking action \(a\) in state \(s\)
\(\gamma \in [0,1]\): Discount factor — how much the agent values future vs. immediate rewards

The agent’s goal is to learn a policy \(\pi(a|s)\) that maximizes the expected cumulative discounted reward:

\[J(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]\]

14.5.2 Mapping MDPs to Data Center Cooling

In the SustainDC cooling environment (dc_gym.py), the MDP components are:

MDP Component	Data Center Cooling
State \(s_t\)	Ambient temperature, CRAH setpoint, zone air temperature, HVAC power, IT power (5-dim vector, normalized)
Action \(a_t\)	Discrete: {decrease setpoint, maintain setpoint, increase setpoint}
Reward \(r_t\)	\(-CFP_t = -(E_{hvac} + E_{it}) \times CI_t\) (negative carbon footprint)
Transition	PyDCM thermal model steps forward one timestep
Episode	One year of operation (8,760 hourly steps)

14.5.3 The OpenAI Gymnasium Interface

The individual sub-environments (dc_gym.py, ls_gym.py, bat_gym.py) follow the standard Gymnasium API:

import gymnasium as gym

# Create the environment
env = gym.make("dc_gym-v0", config=config)

# Reset to initial state
obs, info = env.reset()

# Run one episode
done = False
while not done:
    action = agent.select_action(obs)  # your policy
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

The key methods:

reset() → returns initial observation and info dict. In PyDCM this is ~100x faster than EnergyPlus because there are no IDF files to recompile.
step(action) → advances simulation by one timestep, returns (observation, reward, terminated, truncated, info). PyDCM’s vectorized computation makes this ~30x faster than EnergyPlus.
observation_space and action_space → define the valid ranges for observations and actions.

SustainDC Wrapper API Quirks

The SustainDC multi-agent wrapper (sustaindc_env.py) was designed for the HARL training framework and deviates from the standard Gymnasium API in several ways:

reset() returns only the observation dict—not the (obs, info) tuple you would expect. The info dict is stored internally on self.infos.
observation_space and action_space are lists (one entry per active agent), not dicts keyed by agent name. The ordering follows the "agents" config list.
step() returns the standard 5-tuple (obs, rewards, terminated, truncated, info), but all values are dicts keyed by agent name (e.g., "agent_dc").
Episode end uses truncation, not termination. The wrapper sets truncateds["__all__"] = True but leaves terminateds["__all__"] = False. Your simulation loop must check both: done = terminated.get("__all__", False) or truncated.get("__all__", False). If you only check terminated, the loop will never exit and will eventually crash when data managers run past the end of their arrays.
The month parameter is required. It is not in EnvConfig’s defaults, but if omitted, self.month is None and the environment crashes. The value is 0-indexed: 0 = January, 11 = December.

See the SustainDC hands-on tutorial for working code examples.

14.5.4 Policy Gradient Methods and PPO

Policy gradient methods directly optimize the policy \(\pi_\theta(a|s)\) (parameterized by \(\theta\)) by computing gradients of the expected reward with respect to \(\theta\):

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A_t\right]\]

where \(A_t\) is the advantage function — how much better action \(a_t\) is compared to the average action in state \(s_t\).

Proximal Policy Optimization (PPO) is the most widely used policy gradient algorithm. Its key idea is to prevent destructively large policy updates by clipping the objective:

\[L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) A_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio between the new and old policies, and \(\epsilon\) (typically 0.2) limits how far the new policy can deviate from the old one in a single update.

PPO is the base algorithm for most of SustainDC’s multi-agent methods (IPPO, MAPPO, HAPPO all build on PPO).

14.5.5 Reward Shaping

The choice of reward function fundamentally determines what the agent learns. SustainDC provides several options via utils/reward_creator.py:

Reward Function	What It Optimizes
`default_dc_reward`	Negative carbon footprint: \(-(E_{hvac} + E_{it}) \times CI_t\)
`tou_reward`	Time-of-use electricity cost
`energy_PUE_reward`	Power Usage Effectiveness
`water_usage_efficiency_reward`	Cooling tower water consumption
`temperature_efficiency_reward`	Thermal constraint satisfaction
`custom_agent_reward`	User-defined (template provided)

Reward Design Matters

Using carbon footprint as the reward means the agent will learn to cool less aggressively during high-CI periods (when the grid is dirty) and more aggressively during low-CI periods. This is desirable for sustainability but may conflict with thermal safety. In practice, you often need to combine multiple objectives or add constraint penalties.

14.6 Hands-On Tutorials

The concepts covered above — data center thermal modeling, the Gymnasium interface, PPO, and multi-agent coordination — are explored in depth through two companion tutorials:

Getting Started with PyDCM: Uses the pydcm branch (BuildSys 2023). Covers environment setup, the Gymnasium simulation loop, benchmarking PyDCM’s speed advantage, and training a single-agent PPO controller for the cooling task.
Getting Started with SustainDC: Uses the main branch (NeurIPS 2024). Covers the full multi-agent framework with three coordinated agents, rule-based baseline controllers, HARL training (HAPPO, MAPPO, etc.), and evaluation using the five SustainDC metrics.

Start with the PyDCM tutorial (simpler, single-agent) before moving to SustainDC (multi-agent with more configuration options).

14.7 Connecting to Gnu-RL (Paper 3 Preview)

Assignment 3 asks you to apply the Gnu-RL algorithm (Paper 3) to the SustainDC environment. Gnu-RL uses a fundamentally different policy architecture from the standard neural network policies in MAPPO/HAPPO.

14.7.1 The Differentiable MPC Policy

Instead of a neural network mapping observations to actions, Gnu-RL uses a Differentiable Model Predictive Control (MPC) layer as the policy. At each timestep, the policy solves an optimization problem:

\[\min_{u_t, \ldots, u_{t+T-1}} \sum_{k=t}^{t+T-1} \left[ \frac{\eta}{2} \|x_k - x_{setpoint}\|^2 + \|u_k\| \right]\]

subject to:

\[x_{k+1} = A x_k + B_u u_k + B_d d_k, \quad \underline{u} \leq u_k \leq \overline{u}\]

where:

\(x_k\): state (zone temperatures)
\(u_k\): control action (cooling setpoint adjustment)
\(d_k\): disturbances (weather, workload)
\(T\): planning horizon (e.g., 3 hours ahead)
\(\eta\): weight balancing comfort vs. energy
\(A, B_u, B_d\): learnable system dynamics parameters

The key insight is that this optimization problem is differentiable with respect to \(A\), \(B_u\), \(B_d\), and \(\eta\). This means we can backpropagate through the MPC solver to learn the dynamics model end-to-end.

14.7.2 Why This Matters

Compared to standard neural network policies (PPO, SAC, etc.), the Differentiable MPC policy:

Property	Neural Network Policy	Differentiable MPC Policy
Parameters	Thousands–millions of weights	Only \(A\), \(B_u\), \(B_d\) (a few dozen)
Sample efficiency	Needs many episodes	Learns from limited data
Interpretability	Black box	Learned dynamics are inspectable
Domain knowledge	None encoded	Planning horizon, constraints, cost structure
Pre-training	Random initialization	Imitation learning from baseline controller

14.7.3 Adapting for Data Center Cooling

To apply Gnu-RL to the SustainDC DC cooling environment, you need to define:

State vector \(x_t\) (what temperatures to track):

Zone air temperature (IT room average)
CRAH supply air temperature

Control action \(u_t\):

CRAH setpoint adjustment (maps to the discrete actions in dc_gym.py, or can be relaxed to continuous)

Disturbance vector \(d_t\) (uncontrollable external inputs):

Outdoor dry-bulb temperature (from weather data)
IT workload / CPU utilization (from workload traces)
Grid carbon intensity (if included in the cost function)

Linear dynamics model:

\[\underbrace{\begin{bmatrix} T_{zone} \\ T_{CRAH} \end{bmatrix}}_{x_{t+1}} = \underbrace{A}_{2\times2} \underbrace{\begin{bmatrix} T_{zone} \\ T_{CRAH} \end{bmatrix}}_{x_t} + \underbrace{B_u}_{2\times1} \underbrace{[\Delta T_{setpoint}]}_{u_t} + \underbrace{B_d}_{2\times3} \underbrace{\begin{bmatrix} T_{outdoor} \\ W_{load} \\ CI \end{bmatrix}}_{d_t}\]

14.7.4 Two-Phase Training

Gnu-RL trains in two phases:

Phase 1: Imitation Learning (offline)

Run the baseline controller for 3+ months of simulated time to collect state-action pairs \((x_t, u_t, d_t)\)
Initialize \(A\), \(B_u\), \(B_d\) randomly
Minimize a combined loss:

\[\mathcal{L} = \lambda \sum_t \|x_{t+1} - \hat{x}_{t+1}\|^2 + (1-\lambda) \sum_t \|u_t - \hat{u}_t\|^2\]

where \(\hat{x}_{t+1}\) and \(\hat{u}_t\) are the model’s predictions, and \(\lambda\) balances state prediction accuracy vs. action matching.

Phase 2: Online Learning (policy gradient refinement)

Deploy the pre-trained agent in the SustainDC environment
Continue training with PPO, backpropagating through the MPC solver
The agent fine-tunes \(A\), \(B_u\), \(B_d\) to improve actual performance (not just imitation)

Key Adaptation Consideration

The original Gnu-RL was designed for building HVAC (slow dynamics, 5–15 min timesteps, 3-hour planning horizon). Data center thermal dynamics are faster (1–5 min timesteps) because server rooms have less thermal mass than buildings. You may need to adjust the planning horizon and timestep accordingly. Additionally, data centers have better instrumentation than typical buildings, so the state observations tend to be more reliable.