from dcrl_env import DCRL
import numpy as np
env = DCRL({"agents": ["agent_dc"]})
obs, info = env.reset()
print("Observation space:", env.observation_space["agent_dc"])
print("Action space: ", env.action_space["agent_dc"])15 Hands-On: Getting Started with PyDCM
15.1 Overview
The code in this notebook requires the dc-rl repository and its dependencies. It is not executed during the book build. Follow along by running each cell in your own Python environment.
The original authors provide a Colab notebook — this tutorial is a cleaned-up and expanded version of that notebook.
15.2 Environment Setup
15.2.1 Cloning the Repository
The PyDCM code lives on the pydcm branch of the dc-rl repository. The main branch contains the full SustainDC multi-agent framework (NeurIPS 2024); the pydcm branch has the original BuildSys 2023 implementation along with the DCRL wrapper that we will use here.
git clone -b pydcm https://github.com/HewlettPackard/dc-rl.git
cd dc-rl15.2.2 Installing Dependencies
Initialize a uv project in the cloned repository and install the dependencies:
uv init --python 3.10Before installing packages, edit pyproject.toml to restrict Python to 3.10 only. Change the line requires-python = ">=3.10" to:
requires-python = ">=3.10, <3.11"This is necessary because Ray 2.4.0 has conflicting dependency versions across Python 3.10 vs. 3.11+, and uv resolves for all supported versions by default. Now install:
uv add numpy==1.23.5 pandas==1.5.3 scipy==1.13.0 "ray[rllib]==2.4.0" torch==2.0.0 "setuptools<72" xlsxwriter "opyplus==1.4.2" tensorflow-probabilityRay 2.4.0 will pull in gymnasium==0.26.3 automatically as a dependency. You also need setuptools<72 because Ray 2.4.0 uses pkg_resources, which was removed from setuptools in version 72.
If uv sync fails with a grpcio build error (ModuleNotFoundError: No module named 'pkg_resources'), add this to your pyproject.toml and re-run:
[tool.uv]
override-dependencies = ["grpcio>=1.60.0"]This forces uv to use a newer grpcio that has pre-built Apple Silicon wheels, bypassing the source build of the older version Ray 2.4.0 would otherwise pull in.
The DCRL wrapper (dcrl_env.py) inherits from Ray’s MultiAgentEnv at the module level, so from dcrl_env import DCRL will fail with a ModuleNotFoundError if Ray is not installed. Ray and PyTorch are heavyweight packages (~2 GB together), but they are not optional for this codebase.
If you are running in Google Colab (which does not have uv), you can install packages directly with pip:
pip install numpy==1.23.5 pandas==1.5.3 scipy==1.13.0 "ray[rllib]==2.4.0" torch==2.0.0 xlsxwriter "opyplus==1.4.2" tensorflow-probability15.2.3 Verifying the Setup
A quick sanity check — import the key modules and confirm the environment registers:
You should see output like:
Observation space: Box(-2000000000.0, 5000000000.0, (12,), float32)
Action space: Discrete(9)
The observation is a 12-dimensional vector (normalized state variables), and the agent chooses from 9 discrete actions (combinations of setpoint adjustments).
15.3 Understanding the Configuration
Before running any simulation, it is worth understanding what physical system you are modeling. The data center layout and equipment parameters are defined in utils/dc_config.json.
15.3.1 Key Configuration Parameters
{
"data_center_configuration": {
"NUM_ROWS": 2,
"NUM_RACKS_PER_ROW": 1,
"CPUS_PER_RACK": 150,
"RACK_SUPPLY_APPROACH_TEMP_LIST": [5.3, 5.3],
"RACK_RETURN_APPROACH_TEMP_LIST": [-3.7, -3.7]
},
"hvac_configuration": {
"C_AIR": 1006,
"RHO_AIR": 1.225,
"CHILLER_COP": 6.0,
"CRAC_SUPPLY_AIR_FLOW_RATE_pu": 0.00005663,
"CT_FAN_REF_P": 1000
},
"server_characteristics": {
"CPU_POWER_RATIO_LB": [0.22, 1.00],
"CPU_POWER_RATIO_UB": [0.24, 1.02],
"IT_FAN_AIRFLOW_RATIO_LB": [0.0, 0.6],
"IT_FAN_AIRFLOW_RATIO_UB": [0.7, 1.3],
"INLET_TEMP_RANGE": [18, 27],
"DEFAULT_SERVER_POWER_CHARACTERISTICS": [[170, 110], [120, 60]],
"HP_PROLIANT": [110, 170]
}
}15.3.2 What Each Section Controls
| Section | Parameter | Physical Meaning |
|---|---|---|
data_center_configuration |
NUM_ROWS, NUM_RACKS_PER_ROW |
IT room geometry: how many rack rows and racks per row |
CPUS_PER_RACK |
Number of servers in each rack | |
RACK_SUPPLY_APPROACH_TEMP_LIST |
CFD-derived offsets (\(\Delta T\)) between CRAH supply air and actual rack inlet (one per rack) | |
hvac_configuration |
C_AIR, RHO_AIR |
Thermophysical properties of air (\(c_p = 1006\) J/kg\(\cdot\)K, \(\rho = 1.225\) kg/m\(^3\)) |
CHILLER_COP |
Chiller coefficient of performance (6.0 = 6 kW cooling per 1 kW electricity) | |
CRAC_SUPPLY_AIR_FLOW_RATE_pu |
Volumetric airflow through the CRAH unit | |
server_characteristics |
CPU_POWER_RATIO_LB/UB |
Linear power curve coefficients: how CPU power scales with temperature and load |
IT_FAN_AIRFLOW_RATIO_LB/UB |
Fan speed curve: how server fan airflow scales with temperature | |
INLET_TEMP_RANGE |
ASHRAE-compliant operating range for server inlet temperatures (18–27\(°\)C) | |
DEFAULT_SERVER_POWER_CHARACTERISTICS |
[full_load_power, idle_power] pairs for server types (Watts) |
Recall from our discussion: each rack’s CPUS_PER_RACK servers have their power curve parameters (CPU_POWER_RATIO_LB/UB) collected into NumPy arrays of shape (num_CPUs,). The simulation then computes power for all servers in a rack simultaneously via NumPy broadcasting, rather than looping over individual server objects. The RACK_SUPPLY_APPROACH_TEMP_LIST entries determine each rack’s inlet temperature offset from the CRAH supply.
15.4 Running a Simulation
15.4.1 The Gymnasium Loop
The DCRL wrapper exposes the standard Gymnasium API. Here is the minimal simulation loop:
from dcrl_env import DCRL
import numpy as np
# Configure environment: only the cooling agent is active
env_config = {
"agents": ["agent_dc"],
"location": "ny",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"max_bat_cap_Mw": 2,
"individual_reward_weight": 0.8,
"flexible_load": 0.1,
"ls_reward": "default_ls_reward",
"dc_reward": "default_dc_reward",
"bat_reward": "default_bat_reward",
}
env = DCRL(env_config)
obs, info = env.reset()
# Run one full episode (default: 30 days at 15-min timesteps ≈ 2,977 steps)
done = False
total_reward = 0.0
step_count = 0
while not done:
# Fixed action: "maintain current setpoint"
action = {"agent_dc": 4}
obs, reward, terminated, truncated, info = env.step(action)
total_reward += reward["agent_dc"]
step_count += 1
done = terminated.get("__all__", False)
print(f"Episode finished after {step_count} steps")
print(f"Total reward: {total_reward:.2f}")15.4.2 Understanding the Environment Config
A few things to note about env_config:
agents: Setting["agent_dc"]means only the cooling agent is learning; the workload and battery agents use default baselines.location: Determines which weather and carbon intensity data to load. Options include"ny","az","wa".individual_reward_weight: The \(\alpha\) parameter for collaborative reward sharing. At 0.8, the cooling agent receives 80% of its own reward and 10% from each of the other two agents.flexible_load: Fraction of workload that is delay-tolerant (10% here).
15.5 Benchmarking PyDCM
The paper claims PyDCM is 30–40\(\times\) faster than EnergyPlus. Let’s measure the three key RL methods: init, reset, and step.
import time
from statistics import mean, stdev
from dcrl_env import DCRL
env_config = {
"agents": ["agent_dc"],
"location": "ny",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"max_bat_cap_Mw": 2,
"individual_reward_weight": 0.8,
"flexible_load": 0.1,
"ls_reward": "default_ls_reward",
"dc_reward": "default_dc_reward",
"bat_reward": "default_bat_reward",
}
N_RUNS = 10
N_STEPS = 1000
init_times = []
reset_times = []
step_times = []
for run in range(N_RUNS):
# Time: environment creation
t0 = time.perf_counter()
env = DCRL(env_config)
init_times.append(time.perf_counter() - t0)
# Time: reset
t0 = time.perf_counter()
obs, info = env.reset()
reset_times.append(time.perf_counter() - t0)
# Time: step (average over N_STEPS)
action = env.dc_env.action_space.sample()
t0 = time.perf_counter()
valid_steps = 0
while valid_steps < N_STEPS:
obs, rew, terminated, truncated, info = env.step({"agent_dc": action})
valid_steps += 1
if terminated.get("__all__", False) or truncated.get("__all__", False):
env.reset()
step_times.append((time.perf_counter() - t0) / N_STEPS)
print(f"init : {mean(init_times):.4f} ± {stdev(init_times):.4f} s")
print(f"reset : {mean(reset_times):.6f} ± {stdev(reset_times):.6f} s")
print(f"step : {mean(step_times):.6f} ± {stdev(step_times):.6f} s")15.5.1 Expected Results
For comparison, here are the numbers from the paper (Table 2) alongside what the Colab notebook produces:
| Method | EnergyPlus | PyDCM (paper) | PyDCM (Colab) |
|---|---|---|---|
init |
1.05 s \(\pm\) 23.6 ms | 1.57 ms \(\pm\) 60.4 \(\mu\)s | ~0.20 s |
reset |
2.67 s \(\pm\) 23.8 ms | 0.03 ms \(\pm\) 0.25 \(\mu\)s | ~0.013 s |
step |
0.46 ms \(\pm\) 98 \(\mu\)s | 0.13 ms \(\pm\) 15.8 \(\mu\)s | ~1.18 ms |
The paper’s benchmarks were run on a 48-core Intel Xeon 6248 server. Colab provides a shared VM with fewer resources, so the absolute times are higher. The key takeaway is the relative speedup over EnergyPlus, not the absolute times. On the authors’ server, step was ~8,300 iterations/s; on Colab, ~850 iterations/s.
15.5.2 Episode-Level Timing
You can also measure cumulative simulation time for realistic episode lengths:
def run_episode_timing(num_steps, n_runs=10):
"""Time a full episode of num_steps steps, averaged over n_runs."""
episode_times = []
for _ in range(n_runs):
env = DCRL(env_config)
env.reset()
action = env.dc_env.action_space.sample()
t0 = time.perf_counter()
for _ in range(num_steps):
obs, rew, terminated, truncated, info = env.step({"agent_dc": action})
if terminated.get("__all__", False) or truncated.get("__all__", False):
break
episode_times.append(time.perf_counter() - t0)
return mean(episode_times), stdev(episode_times)
# 7 days at 15-min timesteps = 7 * 24 * 4 = 672 steps
mean_7d, std_7d = run_episode_timing(7 * 24 * 4)
# 30 days at 15-min timesteps = 30 * 24 * 4 = 2880 steps
mean_30d, std_30d = run_episode_timing(30 * 24 * 4)
print(f"7-day episode : {mean_7d:.3f} ± {std_7d:.3f} s")
print(f"30-day episode : {mean_30d:.3f} ± {std_30d:.3f} s")15.6 Evaluating a Baseline Controller
Before training an RL agent, you need a baseline to compare against. Here we run a simple fixed-action controller that always takes the same action (maintain current setpoint):
import numpy as np
from dcrl_env import DCRL
env_config = {
"agents": ["agent_dc"],
"location": "ny",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"max_bat_cap_Mw": 2,
"individual_reward_weight": 0.8,
"flexible_load": 0.1,
"ls_reward": "default_ls_reward",
"dc_reward": "default_dc_reward",
"bat_reward": "default_bat_reward",
"evaluation": True, # enables evaluation mode (used by the load-shifting agent)
}
env = DCRL(env_config)
obs, info = env.reset()
done = False
energy_trace = []
carbon_trace = []
baseline_action = 4 # "maintain current setpoint"
while not done:
obs, rew, terminated, truncated, info = env.step({"agent_dc": baseline_action})
dc_info = info["agent_dc"]
energy_trace.append(dc_info["bat_total_energy_with_battery_KWh"])
carbon_trace.append(dc_info["bat_CO2_footprint"])
done = terminated["__all__"]
print(f"Baseline final energy: {energy_trace[-1]:.2f} kWh")
print(f"Baseline final carbon: {carbon_trace[-1]:.2f} gCO2")info Dict Contains
The info["agent_dc"] dictionary returned at each step contains detailed metrics including total energy consumption (IT + HVAC + battery), carbon footprint, and individual component breakdowns. This is how you track KPIs without modifying the environment code.
15.7 Training a PPO Agent
The repository includes a training script train_ppo.py that uses Ray RLlib to train a PPO agent on the cooling control task.
15.7.1 Running Training
From the repository root:
uv run python train_ppo.pyThis will train for the default number of iterations and save checkpoints to the pydcm/ directory. You can monitor training progress with TensorBoard:
uv run tensorboard --logdir pydcm/15.7.2 What Is the Agent Learning?
During training, the PPO agent interacts with the PyDCM simulation thousands of times per iteration. At each timestep it:
- Observes the current state (ambient temperature, CRAH setpoint, zone temperature, HVAC power, IT power, etc.)
- Selects an action (decrease, maintain, or increase the CRAH supply temperature setpoint)
- Receives a reward equal to the negative carbon footprint: \(r_t = -(E_{hvac,t} + E_{it,t}) \times CI_t\)
The agent learns to adjust the cooling setpoint dynamically — cooling less aggressively when the grid is clean (low \(CI_t\)) or the workload is light, and more aggressively when temperatures approach unsafe limits.
15.8 Comparing Trained Agent vs. Baseline
After training completes, evaluate the trained policy against the fixed-action baseline:
import numpy as np
from pathlib import Path
from ray.rllib.algorithms.algorithm import Algorithm
from dcrl_env import DCRL
# --- Load the trained agent ---
trial_dir = next(Path("pydcm/pydcm_hvac_ppo").glob("PPO_*"))
checkpoint_dir = sorted(trial_dir.glob("checkpoint_*"))[-1]
algo = Algorithm.from_checkpoint(str(checkpoint_dir))
# --- Environment config (evaluation mode) ---
env_config = {
"agents": ["agent_dc"],
"location": "ny",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"max_bat_cap_Mw": 2,
"individual_reward_weight": 0.8,
"flexible_load": 0.1,
"ls_reward": "default_ls_reward",
"dc_reward": "default_dc_reward",
"bat_reward": "default_bat_reward",
"evaluation": True,
}
def rollout(policy_fn):
"""Run one full episode using the given policy function."""
env = DCRL(env_config)
obs, info = env.reset()
done = False
rewards = []
final_energy, final_carbon = None, None
while not done:
action = policy_fn(obs, env)
obs, rew, terminated, truncated, info = env.step({"agent_dc": action})
rewards.append(rew["agent_dc"])
final_energy = info["agent_dc"]["bat_total_energy_with_battery_KWh"]
final_carbon = info["agent_dc"]["bat_CO2_footprint"]
done = terminated["__all__"]
return {
"total_reward": float(np.sum(rewards)),
"final_energy_kwh": float(final_energy),
"final_carbon": float(final_carbon),
}
# --- Define policies ---
def trained_policy(obs, env):
return algo.compute_single_action(
obs["agent_dc"], policy_id="agent_dc", explore=False
)
def baseline_policy(obs, env):
return 4 # fixed: maintain setpoint
# --- Run comparison ---
trained_results = rollout(trained_policy)
baseline_results = rollout(baseline_policy)
print(f"{'Metric':<25} {'Baseline':>12} {'Trained':>12} {'Reduction':>10}")
print("-" * 62)
for key, label in [("final_energy_kwh", "Energy (kWh)"),
("final_carbon", "Carbon (gCO2)")]:
b = baseline_results[key]
t = trained_results[key]
pct = 100 * (b - t) / b
print(f"{label:<25} {b:>12.1f} {t:>12.1f} {pct:>9.1f}%")15.8.1 Expected Results
The Colab notebook reports approximately:
| Metric | Baseline | Trained | Reduction |
|---|---|---|---|
| Energy (kWh) | 554.4 | 443.2 | ~20% |
| Carbon (gCO\(_2\)) | 315,128 | 252,067 | ~20% |
A ~20% reduction in both energy and carbon footprint from a relatively short PPO training run is a strong result, demonstrating that even simple RL agents can significantly outperform fixed-setpoint controllers.
The fixed-action baseline maintains a constant CRAH setpoint regardless of conditions. The RL agent learns to modulate the setpoint based on current weather, workload, and grid carbon intensity. For example, it may allow the data center to run slightly warmer during periods of high carbon intensity (reducing HVAC energy when the grid is dirty) and cool more aggressively when the grid is clean.
15.9 Customizing the Data Center
One of PyDCM’s main advantages is configurability. You can modify utils/dc_config.json to model different data center designs.
15.9.1 Example: Scaling Up
To model a larger data center (e.g., 4 rows \(\times\) 5 racks \(\times\) 200 CPUs per rack = 4,000 servers):
{
"data_center_configuration": {
"NUM_ROWS": 4,
"NUM_RACKS_PER_ROW": 5,
"CPUS_PER_RACK": 200,
"RACK_SUPPLY_APPROACH_TEMP_LIST": [
5.3, 5.3, 5.5, 5.5, 5.7,
5.3, 5.3, 5.5, 5.5, 5.7,
5.3, 5.3, 5.5, 5.5, 5.7,
5.3, 5.3, 5.5, 5.5, 5.7
]
}
}Note that RACK_SUPPLY_APPROACH_TEMP_LIST must have exactly NUM_ROWS * NUM_RACKS_PER_ROW entries — one approach temperature offset per rack, derived from CFD analysis.
15.9.2 Example: Different Server Types
To model heterogeneous servers with different power characteristics, modify the server power curves:
{
"server_characteristics": {
"DEFAULT_SERVER_POWER_CHARACTERISTICS": [
[300, 150],
[170, 110],
[120, 60]
]
}
}Each [full_load_power, idle_power] pair (in Watts) defines a server type. PyDCM will distribute these across the racks.
- Location comparison: Change
locationto"az"(hot climate, Arizona) vs."wa"(mild climate, Washington) and compare baseline energy consumption. How does climate affect cooling costs? - Chiller efficiency: Try
CHILLER_COPvalues of 4.0, 6.0, and 8.0. How sensitive is the total energy consumption to chiller efficiency? - Data center scale: Compare simulation speed for 300, 3,000, and 30,000 CPUs. Does it match the sub-linear scaling claim from Figure 3 in the paper?
15.10 Next Steps
This tutorial covers the pydcm branch — the original BuildSys 2023 implementation. For Assignment 3, you will use the main branch which contains the full SustainDC multi-agent framework and apply the Gnu-RL algorithm (Paper 3) to this environment. The concepts and API patterns are the same; the main difference is the multi-agent coordination layer described in the paper notes.