16 Hands-On: Getting Started with SustainDC

16.1 Overview

Learning Objectives

By the end of this activity, students will be able to:

Set up the full SustainDC multi-agent environment from the main branch
Understand how the three sub-environments (Workload, Data Center, Battery) interact sequentially
Run simulations with different agent configurations (single-agent, multi-agent)
Use the built-in baseline and rule-based controllers
Train multi-agent RL policies using the HARL framework (HAPPO, MAPPO, etc.)
Evaluate trained agents against baselines using the five SustainDC metrics
Customize reward functions and experiment with collaborative reward sharing (\(\alpha\))

Prerequisites

This tutorial assumes familiarity with:

The preceding PyDCM hands-on tutorial, which covers the single-agent cooling control problem
The Paper 2: PyDCM & SustainDC chapter, which explains the multi-agent formulation, reward structure, and MARL algorithms

SustainDC extends PyDCM from a single cooling agent to three coordinated agents. If you are comfortable with the PyDCM Gymnasium loop, the jump to SustainDC is mostly about understanding the multi-agent wrapper and the information flow between environments.

Run This on Your Own Machine

The code in this notebook requires the dc-rl repository (main branch) and its dependencies. It is not executed during the book build. Follow along by running each cell in your own Python environment.

16.2 Environment Setup

16.2.1 Cloning the Repository

SustainDC lives on the main branch of the same repository as PyDCM:

git clone https://github.com/HewlettPackard/dc-rl.git sustaindc
cd sustaindc

Branch Differences

The pydcm branch (BuildSys 2023) has a simpler single-agent wrapper. The main branch (NeurIPS 2024) adds the multi-agent orchestration layer (sustaindc_env.py), the HARL training framework, rule-based baseline agents, and expanded data for 8 US locations.

16.2.2 Installing Dependencies

Initialize a uv project in the cloned repository and install the core simulation dependencies:

uv init --python 3.10
uv add numpy==1.23.5 pandas==1.5.3 gymnasium==0.29.1 scipy==1.13.0 PyYAML==6.0.1 PsychroLib==2.5.0 "setuptools<72" matplotlib dash dash_bootstrap_components

This gives you everything needed to run simulations. For training with the HARL framework, also add:

uv add torch==2.0.0 tensorboard==2.12.3 tensorboardX==2.6.2.2 setproctitle "supersuit==3.7.0" "pettingzoo==1.22.2"

Why Not uv add -r requirements.txt?

The full requirements.txt pulls in PyTorch, TensorFlow, Ray, JAX, and many other heavy libraries — some of which may conflict or fail to install on your platform. Installing the core dependencies explicitly gives you a working simulation environment without the headaches. Add training dependencies only when you need them.

16.3 The SustainDC Architecture

Before writing any code, it helps to understand the key architectural difference from PyDCM: SustainDC orchestrates three interconnected Gymnasium environments through a single wrapper class.

16.3.1 Sequential Execution Within Each Timestep

At every timestep, the three agents act sequentially, with information flowing downstream:

1. Agent_LS  (Workload Scheduler)
   Observes: time of day, carbon intensity forecast, pending workload
   Acts: store delayable tasks / compute all / maximize throughput
        ↓  adjusted workload (B̂_t)
2. Agent_DC  (Cooling Optimizer)
   Observes: ambient temp, room temp, HVAC power, IT power, CI forecast
   Acts: decrease / maintain / increase CRAH setpoint
        ↓  total energy consumption (E_hvac + E_it)
3. Agent_BAT (Battery Manager)
   Observes: battery SoC, DC energy consumption, CI forecast
   Acts: charge / hold / discharge

This ordering matters: the workload agent’s decision changes how much heat the servers generate, which affects the cooling agent’s problem, which in turn determines the total energy that the battery agent must manage.

16.3.2 The `SustainDC` Wrapper

The sustaindc_env.py class handles all of this coordination. It:

Creates the three sub-environments via factory functions
Manages four data managers (Time_Manager, Workload_Manager, Weather_Manager, CI_Manager)
Routes information between sub-environments at each step
Constructs per-agent observation vectors (including CI trend features like slopes and percentiles)
Computes collaborative rewards based on the \(\alpha\) weighting parameter

16.4 Verifying the Setup

Let’s confirm everything works by creating the environment and inspecting its structure:

from sustaindc_env import SustainDC, EnvConfig

# Default config: all three agents active, New York location
env_config = EnvConfig({
    "agents": ["agent_ls", "agent_dc", "agent_bat"],
    "month": 2,  # March (0-indexed: 0=Jan, 11=Dec)
    "location": "NY",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "datacenter_capacity_mw": 1.0,
    "battery_capacity_mwh": 0.5,
    "flexible_load": 0.1,
    "individual_reward_weight": 0.8,
    "ls_reward": "default_ls_reward",
    "dc_reward": "default_dc_reward",
    "bat_reward": "default_bat_reward",
})

env = SustainDC(env_config)
obs = env.reset()  # returns only the states dict (no info tuple)

print("=== Observation Spaces ===")
for i, space in enumerate(env.observation_space):
    print(f"  agent {i}: {space.shape}")

print("\n=== Action Spaces ===")
for i, space in enumerate(env.action_space):
    print(f"  agent {i}: {space}")

print("\n=== Initial Observations (first 5 values) ===")
for agent_id, ob in obs.items():
    print(f"  {agent_id}: {ob[:5]}...")

Ignore the “Warning” Messages

You will see messages like "Warning: Please check if the do nothing action for Battery is the '2' action" during init, and "Warning, using base agent for agent_bat: 2" on every step. These are harmless print() statements from the SustainDC code confirming that baseline agents are active for agents not in your "agents" list. They cannot be suppressed without editing the source.

The month Parameter is Required

The month key is not in EnvConfig’s defaults, but it must be provided — otherwise self.month is None and the environment crashes when creating sub-environments. The value is 0-indexed: 0 = January, 11 = December. During training, the HARL framework sets month automatically via the worker index (cycling through all 12 months across parallel workers). For manual experimentation, you must set it yourself.

You should see three agents, each with their own observation and action spaces. The observation dimensions differ because each agent sees different state variables.

API Quirks

SustainDC was designed for the HARL multi-agent training framework, not as a standard Gymnasium environment. Two things to watch for:

reset() returns only the observation dict — not the (obs, info) tuple you would expect from Gymnasium. The info dict is stored internally on self.infos.
observation_space and action_space are lists (one entry per active agent in order), not dicts keyed by agent name. The ordering follows the order agents appear in the "agents" config list.
step() does return the standard 5-tuple (obs, rewards, terminated, truncated, info), all as dicts keyed by agent name.
Episode end uses truncation, not termination. _handle_terminal() sets truncateds["__all__"] = True but leaves terminateds["__all__"] = False. Your loop must check both: done = terminated.get("__all__", False) or truncated.get("__all__", False). If you only check terminated, the loop will never exit and the simulation will eventually crash when data managers run past the end of their arrays.

16.4.1 Understanding the Action Spaces

All three agents use Discrete(3) action spaces, but the actions mean different things:

Agent	Action 0	Action 1	Action 2	Do-Nothing
\(Agent_{LS}\)	Defer shiftable tasks	Do nothing (process normally)	Process deferred task queue	1
\(Agent_{DC}\)	Decrease CRAH setpoint	Maintain setpoint	Increase setpoint	1
\(Agent_{BAT}\)	Charge battery	Discharge battery	Idle (no charge/discharge)	2

Do-Nothing Actions Are Not Uniform

Note that the “do nothing” action is 1 for the workload and cooling agents but 2 for the battery agent. This is easy to get wrong when writing baseline policies — always use the values from utils/base_agents.py as the reference.

16.5 Running a Multi-Agent Simulation

16.5.1 Full Three-Agent Loop

Here is a complete simulation with all three agents taking fixed baseline actions:

from sustaindc_env import SustainDC, EnvConfig
import numpy as np

env_config = EnvConfig({
    "agents": ["agent_ls", "agent_dc", "agent_bat"],
    "month": 2,
    "location": "NY",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "datacenter_capacity_mw": 1.0,
    "battery_capacity_mwh": 0.5,
    "flexible_load": 0.1,
    "individual_reward_weight": 0.8,
    "ls_reward": "default_ls_reward",
    "dc_reward": "default_dc_reward",
    "bat_reward": "default_bat_reward",
})

env = SustainDC(env_config)
obs = env.reset()

# Fixed baseline actions (do-nothing for each agent)
actions = {
    "agent_ls": 1,   # process normally (no load shifting)
    "agent_dc": 1,   # maintain current setpoint
    "agent_bat": 2,   # idle (no charge/discharge) — note: NOT 1
}

done = False
step_count = 0
total_rewards = {"agent_ls": 0.0, "agent_dc": 0.0, "agent_bat": 0.0}

# Track key metrics over time
carbon_trace = []
hvac_energy_trace = []
it_energy_trace = []

while not done:
    obs, rewards, terminated, truncated, info = env.step(actions)

    for agent_id in total_rewards:
        total_rewards[agent_id] += rewards[agent_id]

    # Extract metrics from the DC agent's info dict
    dc_info = info["agent_dc"]
    carbon_trace.append(dc_info.get("bat_CO2_footprint", 0))
    hvac_energy_trace.append(dc_info.get("dc_HVAC_total_power_kW", 0))
    it_energy_trace.append(dc_info.get("dc_ITE_total_power_kW", 0))

    step_count += 1
    done = terminated.get("__all__", False) or truncated.get("__all__", False)

print(f"Episode finished after {step_count} steps")
print(f"\nTotal rewards:")
for agent_id, r in total_rewards.items():
    print(f"  {agent_id}: {r:.2f}")
print(f"\nFinal carbon footprint: {carbon_trace[-1]:.2f} gCO2")
print(f"Avg HVAC power: {np.mean(hvac_energy_trace):.2f} kW")
print(f"Avg IT power:   {np.mean(it_energy_trace):.2f} kW")

16.5.2 Single-Agent Mode (Cooling Only)

For simpler experiments, you can activate only the cooling agent. The other two agents will automatically use built-in baseline controllers:

env_config_single = EnvConfig({
    "agents": ["agent_dc"],  # only cooling agent is learning
    "month": 2,
    "location": "NY",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "datacenter_capacity_mw": 1.0,
    "battery_capacity_mwh": 0.5,
    "flexible_load": 0.1,
})

env_single = SustainDC(env_config_single)
obs = env_single.reset()

# Now only agent_dc needs an action
done = False
while not done:
    obs, rew, terminated, truncated, info = env_single.step({"agent_dc": 1})
    done = terminated.get("__all__", False) or truncated.get("__all__", False)

Baseline Agents

When an agent is not in the "agents" list, SustainDC uses a do-nothing baseline from utils/base_agents.py. Specifically: the workload agent takes action 1 (process normally, no load shifting), the cooling agent takes action 1 (maintain setpoint), and the battery agent takes action 2 (idle). You will see warnings like "Warning, using base agent for agent_bat: 2" — these are informational and confirm the base agent is active.

16.6 Using Rule-Based Controllers

Before jumping to RL, it is useful to understand the rule-based battery controller in utils/rbc_agents.py. The repository currently ships one RBC agent — for the battery — while the cooling and workload baselines are the do-nothing agents from utils/base_agents.py.

16.6.1 Carbon-Aware Battery Controller

RBCBatteryAgent implements a simple heuristic: it compares the current carbon intensity against a smoothed look-ahead forecast. If the forecast CI is higher than current (grid is getting dirtier), it charges now while the grid is cleaner; otherwise it discharges stored energy:

from utils.rbc_agents import RBCBatteryAgent

# look_ahead: how many future CI steps to consider
# smooth_window: moving average window for the forecast
rbc_battery = RBCBatteryAgent(look_ahead=3, smooth_window=1)

# In your simulation loop, pass the CI forecast and current battery SoC:
# action = rbc_battery.act(ci_forecast_values, current_soc)
# Returns: 0 (charge), 1 (discharge), or 2 (idle)

No RBC Cooling Agent (Yet)

The paper’s benchmarks (Section 6) reference an ASHRAE Guideline 36 cooling baseline, but this is implemented separately in ashrae36_evalpydcm.py, not as a reusable agent class. For the cooling agent, the do-nothing baseline (action 1 = maintain setpoint) is what base_agents.py provides. Writing a proportional or rule-based cooling controller is a good exercise — see the examples in the PyDCM tutorial.

16.7 Training with the HARL Framework

SustainDC includes the Heterogeneous Agent RL (HARL) framework, which supports a wide range of multi-agent algorithms. Training is done through train_sustaindc.py.

16.7.1 Quick Start: HAPPO

HAPPO (Heterogeneous Agent PPO) is the recommended starting algorithm. It handles agents with different observation and action spaces natively:

uv run python train_sustaindc.py --algo happo --exp_name happo_ny

16.7.2 Other Supported Algorithms

Algorithm	Type	Key Idea
IPPO	On-policy, independent	Each agent trains its own PPO; no shared information
MAPPO	On-policy, centralized critic	Shared critic sees all agents’ observations
HAPPO	On-policy, heterogeneous	Like MAPPO but handles different obs/action spaces
HAA2C	On-policy, heterogeneous	Advantage Actor-Critic variant of HAPPO
HAD3QN	Off-policy, heterogeneous	Dueling Double DQN for discrete actions
HASAC	Off-policy, heterogeneous	Soft Actor-Critic with entropy regularization
MADDPG	Off-policy, centralized critic	Multi-Agent DDPG (continuous actions)

To try a different algorithm:

uv run python train_sustaindc.py --algo mappo --exp_name mappo_ny
uv run python train_sustaindc.py --algo ippo --exp_name ippo_ny

16.7.3 Configuration Files

The HARL framework loads configuration from YAML files in harl/configs/:

Algorithm config: harl/configs/algos_cfgs/<algo>.yaml — learning rate, batch size, number of epochs, etc.
Environment config: harl/configs/envs_cfgs/sustaindc.yaml — location, data files, reward settings, agent list

You can override any config parameter from the command line:

# Train in Arizona with a different reward weight
uv run python train_sustaindc.py --algo happo --exp_name happo_az \
    --location AZ \
    --cintensity_file "AZPS_NG_&_avgCI.csv" \
    --weather_file USA_AZ_Davis-Monthan.AFB.722745_TMY3.epw \
    --individual_reward_weight 0.5

16.7.4 Monitoring Training

Training logs are written to TensorBoard format:

uv run tensorboard --logdir results/sustaindc/

Key metrics to monitor:

Episode reward (per agent): should increase over training
Carbon footprint: should decrease
Task queue length: for the LS agent, should stay low (tasks being completed on time)

16.8 Evaluation

16.8.1 The Five SustainDC Metrics

The paper defines five evaluation metrics (Section 5):

Metric	Description	Lower is Better?
\(CO_2\) Footprint (\(CFP\))	Cumulative carbon emissions: \(\sum_t (E_{hvac} + E_{it} + E_{bat}) \times CI_t\)	Yes
HVAC Energy	Total cooling energy (chiller + pumps + cooling tower + CRAH fan)	Yes
IT Energy	Total server energy consumption	Yes
Water Usage	Cooling tower water consumption	Yes
Task Queue	Accumulated deferred workload not completed within the horizon	Yes

16.8.2 Running Evaluation

Use the eval_sustaindc.py script to evaluate a trained checkpoint:

uv run python eval_sustaindc.py --algo happo --exp_name happo_ny

Or evaluate programmatically within a notebook:

from sustaindc_env import SustainDC, EnvConfig
import numpy as np

def evaluate_episode(env, policy_fn):
    """Run one full episode and collect metrics."""
    obs = env.reset()
    done = False
    metrics = {
        "carbon": [], "hvac_energy": [], "it_energy": [],
        "water": [], "rewards": {"agent_ls": 0, "agent_dc": 0, "agent_bat": 0},
    }

    while not done:
        actions = policy_fn(obs)
        obs, rewards, terminated, truncated, info = env.step(actions)

        dc_info = info["agent_dc"]
        metrics["carbon"].append(dc_info.get("bat_CO2_footprint", 0))
        metrics["hvac_energy"].append(dc_info.get("dc_HVAC_total_power_kW", 0))
        metrics["it_energy"].append(dc_info.get("dc_ITE_total_power_kW", 0))
        metrics["water"].append(dc_info.get("dc_water_usage", 0))

        for agent_id in metrics["rewards"]:
            if agent_id in rewards:
                metrics["rewards"][agent_id] += rewards[agent_id]

        done = terminated.get("__all__", False) or truncated.get("__all__", False)

    return {
        "total_carbon": metrics["carbon"][-1] if metrics["carbon"] else 0,
        "avg_hvac_kw": np.mean(metrics["hvac_energy"]),
        "avg_it_kw": np.mean(metrics["it_energy"]),
        "total_water": np.sum(metrics["water"]),
        "rewards": metrics["rewards"],
    }

# --- Baseline policy: do-nothing for each agent ---
DO_NOTHING = {"agent_ls": 1, "agent_dc": 1, "agent_bat": 2}

def baseline_policy(obs):
    return {agent: DO_NOTHING[agent] for agent in obs}

# --- Create environment ---
env_config = EnvConfig({
    "agents": ["agent_ls", "agent_dc", "agent_bat"],
    "month": 2,
    "location": "NY",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "datacenter_capacity_mw": 1.0,
    "battery_capacity_mwh": 0.5,
    "flexible_load": 0.1,
    "individual_reward_weight": 0.8,
    "ls_reward": "default_ls_reward",
    "dc_reward": "default_dc_reward",
    "bat_reward": "default_bat_reward",
})

env = SustainDC(env_config)

baseline_results = evaluate_episode(env, baseline_policy)

print("=== Baseline Results ===")
print(f"  Carbon footprint: {baseline_results['total_carbon']:.2f} gCO2")
print(f"  Avg HVAC power:   {baseline_results['avg_hvac_kw']:.2f} kW")
print(f"  Avg IT power:     {baseline_results['avg_it_kw']:.2f} kW")
print(f"  Total water:      {baseline_results['total_water']:.2f} L")

16.9 Reward Customization

One of SustainDC’s strengths is its flexible reward system. All reward functions are defined in utils/reward_creator.py.

16.9.1 Available Reward Functions

Reward Function	Target Metric
`default_dc_reward`	\(-(E_{hvac} + E_{it}) \times CI_t\) (negative carbon footprint)
`default_ls_reward`	\(-(CFP_t + LS_{penalty})\) (carbon + task completion penalty)
`default_bat_reward`	\(-CFP_t\) (total carbon including battery)
`tou_reward`	Time-of-use electricity cost
`energy_PUE_reward`	Power Usage Effectiveness
`water_usage_efficiency_reward`	Cooling water consumption
`temperature_efficiency_reward`	Thermal constraint satisfaction

16.9.2 Changing Reward Functions

To train with a different objective, simply change the reward strings in the config:

env_config_custom = EnvConfig({
    "agents": ["agent_ls", "agent_dc", "agent_bat"],
    "month": 6,  # July — try a summer month for higher cooling loads
    "location": "NY",
    "cintensity_file": "NYIS_NG_&_avgCI.csv",
    "weather_file": "USA_NY_New.York-Kennedy.Intl.AP.744860_TMY3.epw",
    "workload_file": "Alibaba_CPU_Data_Hourly_1.csv",
    "datacenter_capacity_mw": 1.0,
    "battery_capacity_mwh": 0.5,
    "flexible_load": 0.1,
    # Use water usage as the DC reward instead of carbon
    "dc_reward": "water_usage_efficiency_reward",
    # Keep defaults for the other agents
    "ls_reward": "default_ls_reward",
    "bat_reward": "default_bat_reward",
})

16.9.3 Collaborative Reward Sharing (\(\alpha\))

The individual_reward_weight parameter controls \(\alpha\), the balance between an agent’s own reward and the rewards from other agents:

\[R_{DC} = \frac{(1-\alpha)}{2} \cdot r_{LS} + \alpha \cdot r_{DC} + \frac{(1-\alpha)}{2} \cdot r_{BAT}\]

\(\alpha = 1.0\): fully independent — each agent only sees its own reward
\(\alpha = 0.8\): default — 80% own reward, 10% from each other agent
\(\alpha = 0.1\): highly collaborative — each agent mostly optimizes for the group

The paper (Section 6.2) shows that \(\alpha = 0.8\) outperforms both extremes, especially in partially observable settings where agents benefit from indirect feedback about how their actions affect other subsystems.

# Ablation: try different alpha values
uv run python train_sustaindc.py --algo ippo --exp_name ippo_alpha10 --individual_reward_weight 1.0
uv run python train_sustaindc.py --algo ippo --exp_name ippo_alpha08 --individual_reward_weight 0.8
uv run python train_sustaindc.py --algo ippo --exp_name ippo_alpha01 --individual_reward_weight 0.1

16.10 Multi-Location Experiments

SustainDC ships with data for 8 US locations, chosen for their high data center density and diverse climates/grid mixes:

Location	Climate	Grid Characteristics
AZ (Arizona)	Hot, arid	High solar penetration
CA (California)	Mediterranean	Mixed renewables, moderate CI
GA (Georgia)	Hot, humid	Coal + nuclear, higher CI
IL (Illinois)	Continental	Nuclear-heavy, lower CI
NY (New York)	Continental	Gas + nuclear, moderate CI
TX (Texas)	Hot, variable	Wind + gas, volatile CI
VA (Virginia)	Humid subtropical	Highest DC density in the US
WA (Washington)	Marine	Hydro-heavy, very low CI

The data files are located in:

data/Weather/ — EPW files (Typical Meteorological Year format)
data/CarbonIntensity/ — hourly grid carbon intensity from the US EIA

Experiment Idea: Climate Impact on Control Strategy

Train the same HAPPO agent in Arizona (hot, high cooling load) and Washington (mild, low CI). Compare:

Does the Arizona agent learn more aggressive cooling strategies?
Does the Washington agent rely less on the battery (since the grid is already clean)?
How do carbon footprints compare across locations?

uv run python train_sustaindc.py --algo happo --exp_name happo_az \
    --location AZ \
    --cintensity_file "AZPS_NG_&_avgCI.csv" \
    --weather_file USA_AZ_Davis-Monthan.AFB.722745_TMY3.epw

uv run python train_sustaindc.py --algo happo --exp_name happo_wa \
    --location WA \
    --cintensity_file "WAAT_NG_&_avgCI.csv" \
    --weather_file USA_WA_Port.Angeles-Fairchild.Intl.AP.727885_TMY3.epw

16.11 From PyDCM to SustainDC: What Changed?

If you completed the PyDCM tutorial, here is a summary of what SustainDC adds:

Aspect	PyDCM (`pydcm` branch)	SustainDC (`main` branch)
Agents	Single (cooling only)	Three (workload + cooling + battery)
Training	Ray RLlib PPO	HARL framework (10+ algorithms)
Baselines	Fixed action	Rule-based controllers (ASHRAE 36, etc.)
Reward	Single carbon footprint	Per-agent rewards with collaborative sharing
Locations	3 (NY, AZ, WA)	8 US locations
Workload data	Alibaba only	Alibaba + Google traces
Observation space	5–12 dimensions	26+ dimensions (includes CI trends, forecasts)
Evaluation	Energy + carbon	5 metrics (CFP, HVAC, IT, water, task queue)
Config	`DCRL` wrapper	`SustainDC` + `EnvConfig` with full YAML support

The underlying thermal and HVAC models (the vectorized NumPy code we examined) are the same. The difference is in the orchestration layer and the breadth of the experimental framework.

16.12 Next Steps

Assignment 3: Apply the Gnu-RL differentiable MPC policy (Paper 3) to the SustainDC DC cooling environment. Start in single-agent mode (agents: ["agent_dc"]) before attempting multi-agent.
Algorithm comparison: Reproduce the paper’s radar charts (Figure 5) by training IPPO, MAPPO, and HAPPO on the same location and comparing the five metrics.
Custom rewards: Define a reward function in utils/reward_creator.py that balances carbon footprint and water usage, and observe how the learned policy changes.