from sustaindc_env import SustainDC, EnvConfig
# Default config: all three agents active, New York location
env_config = EnvConfig({
"agents": ["agent_ls", "agent_dc", "agent_bat"],
"month": 2, # March (0-indexed: 0=Jan, 11=Dec)
"location": "NY",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"datacenter_capacity_mw": 1.0,
"battery_capacity_mwh": 0.5,
"flexible_load": 0.1,
"individual_reward_weight": 0.8,
"ls_reward": "default_ls_reward",
"dc_reward": "default_dc_reward",
"bat_reward": "default_bat_reward",
})
env = SustainDC(env_config)
obs = env.reset() # returns only the states dict (no info tuple)
print("=== Observation Spaces ===")
for i, space in enumerate(env.observation_space):
print(f" agent {i}: {space.shape}")
print("\n=== Action Spaces ===")
for i, space in enumerate(env.action_space):
print(f" agent {i}: {space}")
print("\n=== Initial Observations (first 5 values) ===")
for agent_id, ob in obs.items():
print(f" {agent_id}: {ob[:5]}...")16 Hands-On: Getting Started with SustainDC
16.1 Overview
The code in this notebook requires the dc-rl repository (main branch) and its dependencies. It is not executed during the book build. Follow along by running each cell in your own Python environment.
16.2 Environment Setup
16.2.1 Cloning the Repository
SustainDC lives on the main branch of the same repository as PyDCM:
git clone https://github.com/HewlettPackard/dc-rl.git sustaindc
cd sustaindcThe pydcm branch (BuildSys 2023) has a simpler single-agent wrapper. The main branch (NeurIPS 2024) adds the multi-agent orchestration layer (sustaindc_env.py), the HARL training framework, rule-based baseline agents, and expanded data for 8 US locations.
16.2.2 Installing Dependencies
Initialize a uv project in the cloned repository and install the core simulation dependencies:
uv init --python 3.10
uv add numpy==1.23.5 pandas==1.5.3 gymnasium==0.29.1 scipy==1.13.0 PyYAML==6.0.1 PsychroLib==2.5.0 "setuptools<72" matplotlib dash dash_bootstrap_componentsThis gives you everything needed to run simulations. For training with the HARL framework, also add:
uv add torch==2.0.0 tensorboard==2.12.3 tensorboardX==2.6.2.2 setproctitle "supersuit==3.7.0" "pettingzoo==1.22.2"uv add -r requirements.txt?
The full requirements.txt pulls in PyTorch, TensorFlow, Ray, JAX, and many other heavy libraries — some of which may conflict or fail to install on your platform. Installing the core dependencies explicitly gives you a working simulation environment without the headaches. Add training dependencies only when you need them.
16.3 The SustainDC Architecture
Before writing any code, it helps to understand the key architectural difference from PyDCM: SustainDC orchestrates three interconnected Gymnasium environments through a single wrapper class.
16.3.1 Sequential Execution Within Each Timestep
At every timestep, the three agents act sequentially, with information flowing downstream:
1. Agent_LS (Workload Scheduler)
Observes: time of day, carbon intensity forecast, pending workload
Acts: store delayable tasks / compute all / maximize throughput
↓ adjusted workload (B̂_t)
2. Agent_DC (Cooling Optimizer)
Observes: ambient temp, room temp, HVAC power, IT power, CI forecast
Acts: decrease / maintain / increase CRAH setpoint
↓ total energy consumption (E_hvac + E_it)
3. Agent_BAT (Battery Manager)
Observes: battery SoC, DC energy consumption, CI forecast
Acts: charge / hold / discharge
This ordering matters: the workload agent’s decision changes how much heat the servers generate, which affects the cooling agent’s problem, which in turn determines the total energy that the battery agent must manage.
16.3.2 The SustainDC Wrapper
The sustaindc_env.py class handles all of this coordination. It:
- Creates the three sub-environments via factory functions
- Manages four data managers (
Time_Manager,Workload_Manager,Weather_Manager,CI_Manager) - Routes information between sub-environments at each step
- Constructs per-agent observation vectors (including CI trend features like slopes and percentiles)
- Computes collaborative rewards based on the \(\alpha\) weighting parameter
16.4 Verifying the Setup
Let’s confirm everything works by creating the environment and inspecting its structure:
You will see messages like "Warning: Please check if the do nothing action for Battery is the '2' action" during init, and "Warning, using base agent for agent_bat: 2" on every step. These are harmless print() statements from the SustainDC code confirming that baseline agents are active for agents not in your "agents" list. They cannot be suppressed without editing the source.
month Parameter is Required
The month key is not in EnvConfig’s defaults, but it must be provided — otherwise self.month is None and the environment crashes when creating sub-environments. The value is 0-indexed: 0 = January, 11 = December. During training, the HARL framework sets month automatically via the worker index (cycling through all 12 months across parallel workers). For manual experimentation, you must set it yourself.
You should see three agents, each with their own observation and action spaces. The observation dimensions differ because each agent sees different state variables.
SustainDC was designed for the HARL multi-agent training framework, not as a standard Gymnasium environment. Two things to watch for:
reset()returns only the observation dict — not the(obs, info)tuple you would expect from Gymnasium. The info dict is stored internally onself.infos.observation_spaceandaction_spaceare lists (one entry per active agent in order), not dicts keyed by agent name. The ordering follows the order agents appear in the"agents"config list.step()does return the standard 5-tuple(obs, rewards, terminated, truncated, info), all as dicts keyed by agent name.- Episode end uses truncation, not termination.
_handle_terminal()setstruncateds["__all__"] = Truebut leavesterminateds["__all__"] = False. Your loop must check both:done = terminated.get("__all__", False) or truncated.get("__all__", False). If you only checkterminated, the loop will never exit and the simulation will eventually crash when data managers run past the end of their arrays.
16.4.1 Understanding the Action Spaces
All three agents use Discrete(3) action spaces, but the actions mean different things:
| Agent | Action 0 | Action 1 | Action 2 | Do-Nothing |
|---|---|---|---|---|
| \(Agent_{LS}\) | Defer shiftable tasks | Do nothing (process normally) | Process deferred task queue | 1 |
| \(Agent_{DC}\) | Decrease CRAH setpoint | Maintain setpoint | Increase setpoint | 1 |
| \(Agent_{BAT}\) | Charge battery | Discharge battery | Idle (no charge/discharge) | 2 |
Note that the “do nothing” action is 1 for the workload and cooling agents but 2 for the battery agent. This is easy to get wrong when writing baseline policies — always use the values from utils/base_agents.py as the reference.
16.5 Running a Multi-Agent Simulation
16.5.1 Full Three-Agent Loop
Here is a complete simulation with all three agents taking fixed baseline actions:
from sustaindc_env import SustainDC, EnvConfig
import numpy as np
env_config = EnvConfig({
"agents": ["agent_ls", "agent_dc", "agent_bat"],
"month": 2,
"location": "NY",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"datacenter_capacity_mw": 1.0,
"battery_capacity_mwh": 0.5,
"flexible_load": 0.1,
"individual_reward_weight": 0.8,
"ls_reward": "default_ls_reward",
"dc_reward": "default_dc_reward",
"bat_reward": "default_bat_reward",
})
env = SustainDC(env_config)
obs = env.reset()
# Fixed baseline actions (do-nothing for each agent)
actions = {
"agent_ls": 1, # process normally (no load shifting)
"agent_dc": 1, # maintain current setpoint
"agent_bat": 2, # idle (no charge/discharge) — note: NOT 1
}
done = False
step_count = 0
total_rewards = {"agent_ls": 0.0, "agent_dc": 0.0, "agent_bat": 0.0}
# Track key metrics over time
carbon_trace = []
hvac_energy_trace = []
it_energy_trace = []
while not done:
obs, rewards, terminated, truncated, info = env.step(actions)
for agent_id in total_rewards:
total_rewards[agent_id] += rewards[agent_id]
# Extract metrics from the DC agent's info dict
dc_info = info["agent_dc"]
carbon_trace.append(dc_info.get("bat_CO2_footprint", 0))
hvac_energy_trace.append(dc_info.get("dc_HVAC_total_power_kW", 0))
it_energy_trace.append(dc_info.get("dc_ITE_total_power_kW", 0))
step_count += 1
done = terminated.get("__all__", False) or truncated.get("__all__", False)
print(f"Episode finished after {step_count} steps")
print(f"\nTotal rewards:")
for agent_id, r in total_rewards.items():
print(f" {agent_id}: {r:.2f}")
print(f"\nFinal carbon footprint: {carbon_trace[-1]:.2f} gCO2")
print(f"Avg HVAC power: {np.mean(hvac_energy_trace):.2f} kW")
print(f"Avg IT power: {np.mean(it_energy_trace):.2f} kW")16.5.2 Single-Agent Mode (Cooling Only)
For simpler experiments, you can activate only the cooling agent. The other two agents will automatically use built-in baseline controllers:
env_config_single = EnvConfig({
"agents": ["agent_dc"], # only cooling agent is learning
"month": 2,
"location": "NY",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"datacenter_capacity_mw": 1.0,
"battery_capacity_mwh": 0.5,
"flexible_load": 0.1,
})
env_single = SustainDC(env_config_single)
obs = env_single.reset()
# Now only agent_dc needs an action
done = False
while not done:
obs, rew, terminated, truncated, info = env_single.step({"agent_dc": 1})
done = terminated.get("__all__", False) or truncated.get("__all__", False)When an agent is not in the "agents" list, SustainDC uses a do-nothing baseline from utils/base_agents.py. Specifically: the workload agent takes action 1 (process normally, no load shifting), the cooling agent takes action 1 (maintain setpoint), and the battery agent takes action 2 (idle). You will see warnings like "Warning, using base agent for agent_bat: 2" — these are informational and confirm the base agent is active.
16.6 Using Rule-Based Controllers
Before jumping to RL, it is useful to understand the rule-based battery controller in utils/rbc_agents.py. The repository currently ships one RBC agent — for the battery — while the cooling and workload baselines are the do-nothing agents from utils/base_agents.py.
16.6.1 Carbon-Aware Battery Controller
RBCBatteryAgent implements a simple heuristic: it compares the current carbon intensity against a smoothed look-ahead forecast. If the forecast CI is higher than current (grid is getting dirtier), it charges now while the grid is cleaner; otherwise it discharges stored energy:
from utils.rbc_agents import RBCBatteryAgent
# look_ahead: how many future CI steps to consider
# smooth_window: moving average window for the forecast
rbc_battery = RBCBatteryAgent(look_ahead=3, smooth_window=1)
# In your simulation loop, pass the CI forecast and current battery SoC:
# action = rbc_battery.act(ci_forecast_values, current_soc)
# Returns: 0 (charge), 1 (discharge), or 2 (idle)The paper’s benchmarks (Section 6) reference an ASHRAE Guideline 36 cooling baseline, but this is implemented separately in ashrae36_evalpydcm.py, not as a reusable agent class. For the cooling agent, the do-nothing baseline (action 1 = maintain setpoint) is what base_agents.py provides. Writing a proportional or rule-based cooling controller is a good exercise — see the examples in the PyDCM tutorial.
16.7 Training with the HARL Framework
SustainDC includes the Heterogeneous Agent RL (HARL) framework, which supports a wide range of multi-agent algorithms. Training is done through train_sustaindc.py.
16.7.1 Quick Start: HAPPO
HAPPO (Heterogeneous Agent PPO) is the recommended starting algorithm. It handles agents with different observation and action spaces natively:
uv run python train_sustaindc.py --algo happo --exp_name happo_ny16.7.2 Other Supported Algorithms
| Algorithm | Type | Key Idea |
|---|---|---|
| IPPO | On-policy, independent | Each agent trains its own PPO; no shared information |
| MAPPO | On-policy, centralized critic | Shared critic sees all agents’ observations |
| HAPPO | On-policy, heterogeneous | Like MAPPO but handles different obs/action spaces |
| HAA2C | On-policy, heterogeneous | Advantage Actor-Critic variant of HAPPO |
| HAD3QN | Off-policy, heterogeneous | Dueling Double DQN for discrete actions |
| HASAC | Off-policy, heterogeneous | Soft Actor-Critic with entropy regularization |
| MADDPG | Off-policy, centralized critic | Multi-Agent DDPG (continuous actions) |
To try a different algorithm:
uv run python train_sustaindc.py --algo mappo --exp_name mappo_ny
uv run python train_sustaindc.py --algo ippo --exp_name ippo_ny16.7.3 Configuration Files
The HARL framework loads configuration from YAML files in harl/configs/:
- Algorithm config:
harl/configs/algos_cfgs/<algo>.yaml— learning rate, batch size, number of epochs, etc. - Environment config:
harl/configs/envs_cfgs/sustaindc.yaml— location, data files, reward settings, agent list
You can override any config parameter from the command line:
# Train in Arizona with a different reward weight
uv run python train_sustaindc.py --algo happo --exp_name happo_az \
--location AZ \
--cintensity_file "AZPS_NG_&_avgCI.csv" \
--weather_file USA_AZ_Davis-Monthan.AFB.722745_TMY3.epw \
--individual_reward_weight 0.516.7.4 Monitoring Training
Training logs are written to TensorBoard format:
uv run tensorboard --logdir results/sustaindc/Key metrics to monitor:
- Episode reward (per agent): should increase over training
- Carbon footprint: should decrease
- Task queue length: for the LS agent, should stay low (tasks being completed on time)
16.8 Evaluation
16.8.1 The Five SustainDC Metrics
The paper defines five evaluation metrics (Section 5):
| Metric | Description | Lower is Better? |
|---|---|---|
| \(CO_2\) Footprint (\(CFP\)) | Cumulative carbon emissions: \(\sum_t (E_{hvac} + E_{it} + E_{bat}) \times CI_t\) | Yes |
| HVAC Energy | Total cooling energy (chiller + pumps + cooling tower + CRAH fan) | Yes |
| IT Energy | Total server energy consumption | Yes |
| Water Usage | Cooling tower water consumption | Yes |
| Task Queue | Accumulated deferred workload not completed within the horizon | Yes |
16.8.2 Running Evaluation
Use the eval_sustaindc.py script to evaluate a trained checkpoint:
uv run python eval_sustaindc.py --algo happo --exp_name happo_nyOr evaluate programmatically within a notebook:
from sustaindc_env import SustainDC, EnvConfig
import numpy as np
def evaluate_episode(env, policy_fn):
"""Run one full episode and collect metrics."""
obs = env.reset()
done = False
metrics = {
"carbon": [], "hvac_energy": [], "it_energy": [],
"water": [], "rewards": {"agent_ls": 0, "agent_dc": 0, "agent_bat": 0},
}
while not done:
actions = policy_fn(obs)
obs, rewards, terminated, truncated, info = env.step(actions)
dc_info = info["agent_dc"]
metrics["carbon"].append(dc_info.get("bat_CO2_footprint", 0))
metrics["hvac_energy"].append(dc_info.get("dc_HVAC_total_power_kW", 0))
metrics["it_energy"].append(dc_info.get("dc_ITE_total_power_kW", 0))
metrics["water"].append(dc_info.get("dc_water_usage", 0))
for agent_id in metrics["rewards"]:
if agent_id in rewards:
metrics["rewards"][agent_id] += rewards[agent_id]
done = terminated.get("__all__", False) or truncated.get("__all__", False)
return {
"total_carbon": metrics["carbon"][-1] if metrics["carbon"] else 0,
"avg_hvac_kw": np.mean(metrics["hvac_energy"]),
"avg_it_kw": np.mean(metrics["it_energy"]),
"total_water": np.sum(metrics["water"]),
"rewards": metrics["rewards"],
}
# --- Baseline policy: do-nothing for each agent ---
DO_NOTHING = {"agent_ls": 1, "agent_dc": 1, "agent_bat": 2}
def baseline_policy(obs):
return {agent: DO_NOTHING[agent] for agent in obs}
# --- Create environment ---
env_config = EnvConfig({
"agents": ["agent_ls", "agent_dc", "agent_bat"],
"month": 2,
"location": "NY",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"datacenter_capacity_mw": 1.0,
"battery_capacity_mwh": 0.5,
"flexible_load": 0.1,
"individual_reward_weight": 0.8,
"ls_reward": "default_ls_reward",
"dc_reward": "default_dc_reward",
"bat_reward": "default_bat_reward",
})
env = SustainDC(env_config)
baseline_results = evaluate_episode(env, baseline_policy)
print("=== Baseline Results ===")
print(f" Carbon footprint: {baseline_results['total_carbon']:.2f} gCO2")
print(f" Avg HVAC power: {baseline_results['avg_hvac_kw']:.2f} kW")
print(f" Avg IT power: {baseline_results['avg_it_kw']:.2f} kW")
print(f" Total water: {baseline_results['total_water']:.2f} L")16.9 Reward Customization
One of SustainDC’s strengths is its flexible reward system. All reward functions are defined in utils/reward_creator.py.
16.9.1 Available Reward Functions
| Reward Function | Target Metric |
|---|---|
default_dc_reward |
\(-(E_{hvac} + E_{it}) \times CI_t\) (negative carbon footprint) |
default_ls_reward |
\(-(CFP_t + LS_{penalty})\) (carbon + task completion penalty) |
default_bat_reward |
\(-CFP_t\) (total carbon including battery) |
tou_reward |
Time-of-use electricity cost |
energy_PUE_reward |
Power Usage Effectiveness |
water_usage_efficiency_reward |
Cooling water consumption |
temperature_efficiency_reward |
Thermal constraint satisfaction |
16.9.2 Changing Reward Functions
To train with a different objective, simply change the reward strings in the config:
env_config_custom = EnvConfig({
"agents": ["agent_ls", "agent_dc", "agent_bat"],
"month": 6, # July — try a summer month for higher cooling loads
"location": "NY",
"cintensity_file": "NYIS_NG_&_avgCI.csv",
"weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
"workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
"datacenter_capacity_mw": 1.0,
"battery_capacity_mwh": 0.5,
"flexible_load": 0.1,
# Use water usage as the DC reward instead of carbon
"dc_reward": "water_usage_efficiency_reward",
# Keep defaults for the other agents
"ls_reward": "default_ls_reward",
"bat_reward": "default_bat_reward",
})16.9.3 Collaborative Reward Sharing (\(\alpha\))
The individual_reward_weight parameter controls \(\alpha\), the balance between an agent’s own reward and the rewards from other agents:
\[R_{DC} = \frac{(1-\alpha)}{2} \cdot r_{LS} + \alpha \cdot r_{DC} + \frac{(1-\alpha)}{2} \cdot r_{BAT}\]
- \(\alpha = 1.0\): fully independent — each agent only sees its own reward
- \(\alpha = 0.8\): default — 80% own reward, 10% from each other agent
- \(\alpha = 0.1\): highly collaborative — each agent mostly optimizes for the group
The paper (Section 6.2) shows that \(\alpha = 0.8\) outperforms both extremes, especially in partially observable settings where agents benefit from indirect feedback about how their actions affect other subsystems.
# Ablation: try different alpha values
uv run python train_sustaindc.py --algo ippo --exp_name ippo_alpha10 --individual_reward_weight 1.0
uv run python train_sustaindc.py --algo ippo --exp_name ippo_alpha08 --individual_reward_weight 0.8
uv run python train_sustaindc.py --algo ippo --exp_name ippo_alpha01 --individual_reward_weight 0.116.10 Multi-Location Experiments
SustainDC ships with data for 8 US locations, chosen for their high data center density and diverse climates/grid mixes:
| Location | Climate | Grid Characteristics |
|---|---|---|
| AZ (Arizona) | Hot, arid | High solar penetration |
| CA (California) | Mediterranean | Mixed renewables, moderate CI |
| GA (Georgia) | Hot, humid | Coal + nuclear, higher CI |
| IL (Illinois) | Continental | Nuclear-heavy, lower CI |
| NY (New York) | Continental | Gas + nuclear, moderate CI |
| TX (Texas) | Hot, variable | Wind + gas, volatile CI |
| VA (Virginia) | Humid subtropical | Highest DC density in the US |
| WA (Washington) | Marine | Hydro-heavy, very low CI |
The data files are located in:
data/Weather/— EPW files (Typical Meteorological Year format)data/CarbonIntensity/— hourly grid carbon intensity from the US EIA
Train the same HAPPO agent in Arizona (hot, high cooling load) and Washington (mild, low CI). Compare:
- Does the Arizona agent learn more aggressive cooling strategies?
- Does the Washington agent rely less on the battery (since the grid is already clean)?
- How do carbon footprints compare across locations?
uv run python train_sustaindc.py --algo happo --exp_name happo_az \
--location AZ \
--cintensity_file "AZPS_NG_&_avgCI.csv" \
--weather_file USA_AZ_Davis-Monthan.AFB.722745_TMY3.epw
uv run python train_sustaindc.py --algo happo --exp_name happo_wa \
--location WA \
--cintensity_file "WAAT_NG_&_avgCI.csv" \
--weather_file USA_WA_Port.Angeles-Fairchild.Intl.AP.727885_TMY3.epw16.11 From PyDCM to SustainDC: What Changed?
If you completed the PyDCM tutorial, here is a summary of what SustainDC adds:
| Aspect | PyDCM (pydcm branch) |
SustainDC (main branch) |
|---|---|---|
| Agents | Single (cooling only) | Three (workload + cooling + battery) |
| Training | Ray RLlib PPO | HARL framework (10+ algorithms) |
| Baselines | Fixed action | Rule-based controllers (ASHRAE 36, etc.) |
| Reward | Single carbon footprint | Per-agent rewards with collaborative sharing |
| Locations | 3 (NY, AZ, WA) | 8 US locations |
| Workload data | Alibaba only | Alibaba + Google traces |
| Observation space | 5–12 dimensions | 26+ dimensions (includes CI trends, forecasts) |
| Evaluation | Energy + carbon | 5 metrics (CFP, HVAC, IT, water, task queue) |
| Config | DCRL wrapper |
SustainDC + EnvConfig with full YAML support |
The underlying thermal and HVAC models (the vectorized NumPy code we examined) are the same. The difference is in the orchestration layer and the breadth of the experimental framework.
16.12 Next Steps
- Assignment 3: Apply the Gnu-RL differentiable MPC policy (Paper 3) to the SustainDC DC cooling environment. Start in single-agent mode (
agents: ["agent_dc"]) before attempting multi-agent. - Algorithm comparison: Reproduce the paper’s radar charts (Figure 5) by training IPPO, MAPPO, and HAPPO on the same location and comparing the five metrics.
- Custom rewards: Define a reward function in
utils/reward_creator.pythat balances carbon footprint and water usage, and observe how the learned policy changes.