15 Hands-On: Getting Started with PyDCM

15.1 Overview

Learning Objectives

By the end of this activity, students will be able to:

Set up and configure a PyDCM data center simulation environment
Understand the relationship between the JSON configuration and the physical data center being modeled
Run a simulation loop using the OpenAI Gymnasium interface
Benchmark PyDCM step times and understand the source of its speed advantage
Implement and evaluate a baseline controller
Train a PPO agent and compare its performance against the baseline

Prerequisites

This tutorial assumes you have read the preceding chapter, Paper 2: PyDCM & SustainDC, which covers:

The PyDCM architecture and its advantages over EnergyPlus
Data center cooling fundamentals (heat rejection chain, PUE, approach temperatures)
Reinforcement learning basics (MDPs, PPO, the Gymnasium interface)
The SustainDC multi-agent framework

Here we put those concepts into practice.

Run This on Your Own Machine (or Colab)

The code in this notebook requires the dc-rl repository and its dependencies. It is not executed during the book build. Follow along by running each cell in your own Python environment.

The original authors provide a Colab notebook — this tutorial is a cleaned-up and expanded version of that notebook.

15.2 Environment Setup

15.2.1 Cloning the Repository

The PyDCM code lives on the pydcm branch of the dc-rl repository. The main branch contains the full SustainDC multi-agent framework (NeurIPS 2024); the pydcm branch has the original BuildSys 2023 implementation along with the DCRL wrapper that we will use here.

git clone -b pydcm https://github.com/HewlettPackard/dc-rl.git
cd dc-rl

15.2.2 Installing Dependencies

Initialize a uv project in the cloned repository and install the dependencies:

uv init --python 3.10

Before installing packages, edit pyproject.toml to restrict Python to 3.10 only. Change the line requires-python = ">=3.10" to:

requires-python = ">=3.10, <3.11"

This is necessary because Ray 2.4.0 has conflicting dependency versions across Python 3.10 vs. 3.11+, and uv resolves for all supported versions by default. Now install:

uv add numpy==1.23.5 pandas==1.5.3 scipy==1.13.0 "ray[rllib]==2.4.0" torch==2.0.0 "setuptools<72" xlsxwriter "opyplus==1.4.2" tensorflow-probability

Ray 2.4.0 will pull in gymnasium==0.26.3 automatically as a dependency. You also need setuptools<72 because Ray 2.4.0 uses pkg_resources, which was removed from setuptools in version 72.

macOS Apple Silicon Troubleshooting

If uv sync fails with a grpcio build error (ModuleNotFoundError: No module named 'pkg_resources'), add this to your pyproject.toml and re-run:

[tool.uv]
override-dependencies = ["grpcio>=1.60.0"]

This forces uv to use a newer grpcio that has pre-built Apple Silicon wheels, bypassing the source build of the older version Ray 2.4.0 would otherwise pull in.

Why Is Ray Required?

The DCRL wrapper (dcrl_env.py) inherits from Ray’s MultiAgentEnv at the module level, so from dcrl_env import DCRL will fail with a ModuleNotFoundError if Ray is not installed. Ray and PyTorch are heavyweight packages (~2 GB together), but they are not optional for this codebase.

Colab Users

If you are running in Google Colab (which does not have uv), you can install packages directly with pip:

pip install numpy==1.23.5 pandas==1.5.3 scipy==1.13.0 "ray[rllib]==2.4.0" torch==2.0.0 xlsxwriter "opyplus==1.4.2" tensorflow-probability

15.2.3 Verifying the Setup

A quick sanity check — import the key modules and confirm the environment registers:

from dcrl_env import DCRL
import numpy as np

env = DCRL({"agents": ["agent_dc"]})
obs, info = env.reset()

print("Observation space:", env.observation_space["agent_dc"])
print("Action space:     ", env.action_space["agent_dc"])

You should see output like:

Observation space: Box(-2000000000.0, 5000000000.0, (12,), float32)
Action space:      Discrete(9)

The observation is a 12-dimensional vector (normalized state variables), and the agent chooses from 9 discrete actions (combinations of setpoint adjustments).

15.3 Understanding the Configuration

Before running any simulation, it is worth understanding what physical system you are modeling. The data center layout and equipment parameters are defined in utils/dc_config.json.

15.3.1 Key Configuration Parameters

{
  "data_center_configuration": {
    "NUM_ROWS": 2,
    "NUM_RACKS_PER_ROW": 1,
    "CPUS_PER_RACK": 150,
    "RACK_SUPPLY_APPROACH_TEMP_LIST": [5.3, 5.3],
    "RACK_RETURN_APPROACH_TEMP_LIST": [-3.7, -3.7]
  },
  "hvac_configuration": {
    "C_AIR": 1006,
    "RHO_AIR": 1.225,
    "CHILLER_COP": 6.0,
    "CRAC_SUPPLY_AIR_FLOW_RATE_pu": 0.00005663,
    "CT_FAN_REF_P": 1000
  },
  "server_characteristics": {
    "CPU_POWER_RATIO_LB": [0.22, 1.00],
    "CPU_POWER_RATIO_UB": [0.24, 1.02],
    "IT_FAN_AIRFLOW_RATIO_LB": [0.0, 0.6],
    "IT_FAN_AIRFLOW_RATIO_UB": [0.7, 1.3],
    "INLET_TEMP_RANGE": [18, 27],
    "DEFAULT_SERVER_POWER_CHARACTERISTICS": [[170, 110], [120, 60]],
    "HP_PROLIANT": [110, 170]
  }
}

15.3.2 What Each Section Controls

Section	Parameter	Physical Meaning
`data_center_configuration`	`NUM_ROWS`, `NUM_RACKS_PER_ROW`	IT room geometry: how many rack rows and racks per row
	`CPUS_PER_RACK`	Number of servers in each rack
	`RACK_SUPPLY_APPROACH_TEMP_LIST`	CFD-derived offsets (\(\Delta T\)) between CRAH supply air and actual rack inlet (one per rack)
`hvac_configuration`	`C_AIR`, `RHO_AIR`	Thermophysical properties of air (\(c_p = 1006\) J/kg\(\cdot\)K, \(\rho = 1.225\) kg/m\(^3\))
	`CHILLER_COP`	Chiller coefficient of performance (6.0 = 6 kW cooling per 1 kW electricity)
	`CRAC_SUPPLY_AIR_FLOW_RATE_pu`	Volumetric airflow through the CRAH unit
`server_characteristics`	`CPU_POWER_RATIO_LB/UB`	Linear power curve coefficients: how CPU power scales with temperature and load
	`IT_FAN_AIRFLOW_RATIO_LB/UB`	Fan speed curve: how server fan airflow scales with temperature
	`INLET_TEMP_RANGE`	ASHRAE-compliant operating range for server inlet temperatures (18–27\(°\)C)
	`DEFAULT_SERVER_POWER_CHARACTERISTICS`	`[full_load_power, idle_power]` pairs for server types (Watts)

The Vectorization Connection

Recall from our discussion: each rack’s CPUS_PER_RACK servers have their power curve parameters (CPU_POWER_RATIO_LB/UB) collected into NumPy arrays of shape (num_CPUs,). The simulation then computes power for all servers in a rack simultaneously via NumPy broadcasting, rather than looping over individual server objects. The RACK_SUPPLY_APPROACH_TEMP_LIST entries determine each rack’s inlet temperature offset from the CRAH supply.

15.4 Running a Simulation

15.4.1 The Gymnasium Loop

The DCRL wrapper exposes the standard Gymnasium API. Here is the minimal simulation loop:

from dcrl_env import DCRL
import numpy as np

# Configure environment: only the cooling agent is active
env_config = {
    "agents": ["agent_dc"],
    "location": "ny",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "max_bat_cap_Mw": 2,
    "individual_reward_weight": 0.8,
    "flexible_load": 0.1,
    "ls_reward": "default_ls_reward",
    "dc_reward": "default_dc_reward",
    "bat_reward": "default_bat_reward",
}

env = DCRL(env_config)
obs, info = env.reset()

# Run one full episode (default: 30 days at 15-min timesteps ≈ 2,977 steps)
done = False
total_reward = 0.0
step_count = 0

while not done:
    # Fixed action: "maintain current setpoint"
    action = {"agent_dc": 4}
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward["agent_dc"]
    step_count += 1
    done = terminated.get("__all__", False)

print(f"Episode finished after {step_count} steps")
print(f"Total reward: {total_reward:.2f}")

15.4.2 Understanding the Environment Config

A few things to note about env_config:

agents: Setting ["agent_dc"] means only the cooling agent is learning; the workload and battery agents use default baselines.
location: Determines which weather and carbon intensity data to load. Options include "ny", "az", "wa".
individual_reward_weight: The \(\alpha\) parameter for collaborative reward sharing. At 0.8, the cooling agent receives 80% of its own reward and 10% from each of the other two agents.
flexible_load: Fraction of workload that is delay-tolerant (10% here).

15.5 Benchmarking PyDCM

The paper claims PyDCM is 30–40\(\times\) faster than EnergyPlus. Let’s measure the three key RL methods: init, reset, and step.

import time
from statistics import mean, stdev
from dcrl_env import DCRL

env_config = {
    "agents": ["agent_dc"],
    "location": "ny",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "max_bat_cap_Mw": 2,
    "individual_reward_weight": 0.8,
    "flexible_load": 0.1,
    "ls_reward": "default_ls_reward",
    "dc_reward": "default_dc_reward",
    "bat_reward": "default_bat_reward",
}

N_RUNS = 10
N_STEPS = 1000

init_times = []
reset_times = []
step_times = []

for run in range(N_RUNS):
    # Time: environment creation
    t0 = time.perf_counter()
    env = DCRL(env_config)
    init_times.append(time.perf_counter() - t0)

    # Time: reset
    t0 = time.perf_counter()
    obs, info = env.reset()
    reset_times.append(time.perf_counter() - t0)

    # Time: step (average over N_STEPS)
    action = env.dc_env.action_space.sample()
    t0 = time.perf_counter()
    valid_steps = 0
    while valid_steps < N_STEPS:
        obs, rew, terminated, truncated, info = env.step({"agent_dc": action})
        valid_steps += 1
        if terminated.get("__all__", False) or truncated.get("__all__", False):
            env.reset()
    step_times.append((time.perf_counter() - t0) / N_STEPS)

print(f"init  : {mean(init_times):.4f} ± {stdev(init_times):.4f} s")
print(f"reset : {mean(reset_times):.6f} ± {stdev(reset_times):.6f} s")
print(f"step  : {mean(step_times):.6f} ± {stdev(step_times):.6f} s")

15.5.1 Expected Results

For comparison, here are the numbers from the paper (Table 2) alongside what the Colab notebook produces:

Method	EnergyPlus	PyDCM (paper)	PyDCM (Colab)
`init`	1.05 s \(\pm\) 23.6 ms	1.57 ms \(\pm\) 60.4 \(\mu\)s	~0.20 s
`reset`	2.67 s \(\pm\) 23.8 ms	0.03 ms \(\pm\) 0.25 \(\mu\)s	~0.013 s
`step`	0.46 ms \(\pm\) 98 \(\mu\)s	0.13 ms \(\pm\) 15.8 \(\mu\)s	~1.18 ms

Why Are the Colab Numbers Slower?

The paper’s benchmarks were run on a 48-core Intel Xeon 6248 server. Colab provides a shared VM with fewer resources, so the absolute times are higher. The key takeaway is the relative speedup over EnergyPlus, not the absolute times. On the authors’ server, step was ~8,300 iterations/s; on Colab, ~850 iterations/s.

15.5.2 Episode-Level Timing

You can also measure cumulative simulation time for realistic episode lengths:

def run_episode_timing(num_steps, n_runs=10):
    """Time a full episode of num_steps steps, averaged over n_runs."""
    episode_times = []
    for _ in range(n_runs):
        env = DCRL(env_config)
        env.reset()
        action = env.dc_env.action_space.sample()

        t0 = time.perf_counter()
        for _ in range(num_steps):
            obs, rew, terminated, truncated, info = env.step({"agent_dc": action})
            if terminated.get("__all__", False) or truncated.get("__all__", False):
                break
        episode_times.append(time.perf_counter() - t0)
    return mean(episode_times), stdev(episode_times)

# 7 days at 15-min timesteps = 7 * 24 * 4 = 672 steps
mean_7d, std_7d = run_episode_timing(7 * 24 * 4)
# 30 days at 15-min timesteps = 30 * 24 * 4 = 2880 steps
mean_30d, std_30d = run_episode_timing(30 * 24 * 4)

print(f"7-day episode  : {mean_7d:.3f} ± {std_7d:.3f} s")
print(f"30-day episode : {mean_30d:.3f} ± {std_30d:.3f} s")

15.6 Evaluating a Baseline Controller

Before training an RL agent, you need a baseline to compare against. Here we run a simple fixed-action controller that always takes the same action (maintain current setpoint):

import numpy as np
from dcrl_env import DCRL

env_config = {
    "agents": ["agent_dc"],
    "location": "ny",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "max_bat_cap_Mw": 2,
    "individual_reward_weight": 0.8,
    "flexible_load": 0.1,
    "ls_reward": "default_ls_reward",
    "dc_reward": "default_dc_reward",
    "bat_reward": "default_bat_reward",
    "evaluation": True,  # enables evaluation mode (used by the load-shifting agent)
}

env = DCRL(env_config)
obs, info = env.reset()

done = False
energy_trace = []
carbon_trace = []
baseline_action = 4  # "maintain current setpoint"

while not done:
    obs, rew, terminated, truncated, info = env.step({"agent_dc": baseline_action})
    dc_info = info["agent_dc"]
    energy_trace.append(dc_info["bat_total_energy_with_battery_KWh"])
    carbon_trace.append(dc_info["bat_CO2_footprint"])
    done = terminated["__all__"]

print(f"Baseline final energy: {energy_trace[-1]:.2f} kWh")
print(f"Baseline final carbon: {carbon_trace[-1]:.2f} gCO2")

What the info Dict Contains

The info["agent_dc"] dictionary returned at each step contains detailed metrics including total energy consumption (IT + HVAC + battery), carbon footprint, and individual component breakdowns. This is how you track KPIs without modifying the environment code.

15.7 Training a PPO Agent

The repository includes a training script train_ppo.py that uses Ray RLlib to train a PPO agent on the cooling control task.

15.7.1 Running Training

From the repository root:

uv run python train_ppo.py

This will train for the default number of iterations and save checkpoints to the pydcm/ directory. You can monitor training progress with TensorBoard:

uv run tensorboard --logdir pydcm/

15.7.2 What Is the Agent Learning?

During training, the PPO agent interacts with the PyDCM simulation thousands of times per iteration. At each timestep it:

Observes the current state (ambient temperature, CRAH setpoint, zone temperature, HVAC power, IT power, etc.)
Selects an action (decrease, maintain, or increase the CRAH supply temperature setpoint)
Receives a reward equal to the negative carbon footprint: \(r_t = -(E_{hvac,t} + E_{it,t}) \times CI_t\)

The agent learns to adjust the cooling setpoint dynamically — cooling less aggressively when the grid is clean (low \(CI_t\)) or the workload is light, and more aggressively when temperatures approach unsafe limits.

15.8 Comparing Trained Agent vs. Baseline

After training completes, evaluate the trained policy against the fixed-action baseline:

import numpy as np
from pathlib import Path
from ray.rllib.algorithms.algorithm import Algorithm
from dcrl_env import DCRL

# --- Load the trained agent ---
trial_dir = next(Path("pydcm/pydcm_hvac_ppo").glob("PPO_*"))
checkpoint_dir = sorted(trial_dir.glob("checkpoint_*"))[-1]
algo = Algorithm.from_checkpoint(str(checkpoint_dir))

# --- Environment config (evaluation mode) ---
env_config = {
    "agents": ["agent_dc"],
    "location": "ny",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "max_bat_cap_Mw": 2,
    "individual_reward_weight": 0.8,
    "flexible_load": 0.1,
    "ls_reward": "default_ls_reward",
    "dc_reward": "default_dc_reward",
    "bat_reward": "default_bat_reward",
    "evaluation": True,
}

def rollout(policy_fn):
    """Run one full episode using the given policy function."""
    env = DCRL(env_config)
    obs, info = env.reset()
    done = False
    rewards = []
    final_energy, final_carbon = None, None

    while not done:
        action = policy_fn(obs, env)
        obs, rew, terminated, truncated, info = env.step({"agent_dc": action})
        rewards.append(rew["agent_dc"])
        final_energy = info["agent_dc"]["bat_total_energy_with_battery_KWh"]
        final_carbon = info["agent_dc"]["bat_CO2_footprint"]
        done = terminated["__all__"]

    return {
        "total_reward": float(np.sum(rewards)),
        "final_energy_kwh": float(final_energy),
        "final_carbon": float(final_carbon),
    }

# --- Define policies ---
def trained_policy(obs, env):
    return algo.compute_single_action(
        obs["agent_dc"], policy_id="agent_dc", explore=False
    )

def baseline_policy(obs, env):
    return 4  # fixed: maintain setpoint

# --- Run comparison ---
trained_results = rollout(trained_policy)
baseline_results = rollout(baseline_policy)

print(f"{'Metric':<25} {'Baseline':>12} {'Trained':>12} {'Reduction':>10}")
print("-" * 62)
for key, label in [("final_energy_kwh", "Energy (kWh)"),
                   ("final_carbon", "Carbon (gCO2)")]:
    b = baseline_results[key]
    t = trained_results[key]
    pct = 100 * (b - t) / b
    print(f"{label:<25} {b:>12.1f} {t:>12.1f} {pct:>9.1f}%")

15.8.1 Expected Results

The Colab notebook reports approximately:

Metric	Baseline	Trained	Reduction
Energy (kWh)	554.4	443.2	~20%
Carbon (gCO\(_2\))	315,128	252,067	~20%

A ~20% reduction in both energy and carbon footprint from a relatively short PPO training run is a strong result, demonstrating that even simple RL agents can significantly outperform fixed-setpoint controllers.

Why Does This Work?

The fixed-action baseline maintains a constant CRAH setpoint regardless of conditions. The RL agent learns to modulate the setpoint based on current weather, workload, and grid carbon intensity. For example, it may allow the data center to run slightly warmer during periods of high carbon intensity (reducing HVAC energy when the grid is dirty) and cool more aggressively when the grid is clean.

15.9 Customizing the Data Center

One of PyDCM’s main advantages is configurability. You can modify utils/dc_config.json to model different data center designs.

15.9.1 Example: Scaling Up

To model a larger data center (e.g., 4 rows \(\times\) 5 racks \(\times\) 200 CPUs per rack = 4,000 servers):

{
  "data_center_configuration": {
    "NUM_ROWS": 4,
    "NUM_RACKS_PER_ROW": 5,
    "CPUS_PER_RACK": 200,
    "RACK_SUPPLY_APPROACH_TEMP_LIST": [
      5.3, 5.3, 5.5, 5.5, 5.7,
      5.3, 5.3, 5.5, 5.5, 5.7,
      5.3, 5.3, 5.5, 5.5, 5.7,
      5.3, 5.3, 5.5, 5.5, 5.7
    ]
  }
}

Note that RACK_SUPPLY_APPROACH_TEMP_LIST must have exactly NUM_ROWS * NUM_RACKS_PER_ROW entries — one approach temperature offset per rack, derived from CFD analysis.

15.9.2 Example: Different Server Types

To model heterogeneous servers with different power characteristics, modify the server power curves:

{
  "server_characteristics": {
    "DEFAULT_SERVER_POWER_CHARACTERISTICS": [
      [300, 150],
      [170, 110],
      [120, 60]
    ]
  }
}

Each [full_load_power, idle_power] pair (in Watts) defines a server type. PyDCM will distribute these across the racks.

Experiment Ideas

Location comparison: Change location to "az" (hot climate, Arizona) vs. "wa" (mild climate, Washington) and compare baseline energy consumption. How does climate affect cooling costs?
Chiller efficiency: Try CHILLER_COP values of 4.0, 6.0, and 8.0. How sensitive is the total energy consumption to chiller efficiency?
Data center scale: Compare simulation speed for 300, 3,000, and 30,000 CPUs. Does it match the sub-linear scaling claim from Figure 3 in the paper?

15.10 Next Steps

This tutorial covers the pydcm branch — the original BuildSys 2023 implementation. For Assignment 3, you will use the main branch which contains the full SustainDC multi-agent framework and apply the Gnu-RL algorithm (Paper 3) to this environment. The concepts and API patterns are the same; the main difference is the multi-agent coordination layer described in the paper notes.