20  Can Attention Improve Sequence-to-Point Load Disaggregation?

20.1 Overview

Bouchur, M., Li, N., & Reinhardt, A. 2025. Can Attention Improve Sequence-to-Point Load Disaggregation? A Comparative Assessment. In The 12th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys ’25), November 19–21, 2025, Golden, CO, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3736425.3772099

Application domain: Residential building energy management—specifically disaggregating per-appliance power consumption from a single whole-house meter (Non-Intrusive Load Monitoring, or NILM).

  • Sense: This is, perhaps, the primary innovation. The paper improves the sensing (and perception) pipeline by adding attention modules to the Sequence-to-Point (S2P) neural network used in processing power measurements from a building, enabling better extraction of appliance-specific power signatures from the aggregate mains signal. The input is aggregate active power measured at a single point (the mains); the output is appliance-level power estimates. One could consider this a plan/learn process, but that’s just a matter of taste.

  • Plan: Not a focus. Once appliance-level consumption is known, planning (e.g., demand response scheduling, energy audits) is enabled but not addressed here.

  • Act: Not addressed. No actuation or control loop is proposed.

20.2 Review of the paper

20.2.1 Summary

The paper addresses Non-Intrusive Load Monitoring (NILM) using the Sequence-to-Point (S2P) model—a CNN that takes a window of aggregate power readings and predicts the midpoint power of a single target appliance. Its central contribution is a systematic comparison of seven attention-based extensions from three families, inserted between the last convolutional layer and the dense head of the S2P architecture.

Key contributions and findings:

  1. Seven attention variants from three families: Channel Attention (CA), Feed-Forward Attention (FFA), and Self-Attention (SA), each with 1–3 sub-variants.

  2. Context-dependent performance: The best attention variant depends on both appliance type and dataset—there is no one-size-fits-all winner.

  3. Appliance-specific findings on UK-DALE:

    • For transient, high-power loads (microwave, kettle): SA+MLP+PE (transformer-style self-attention with positional encoding) works best, achieving \(-15\%\) MAE on average and \(-45\%\) on microwave.
    • For steady/periodic loads (fridge): SA+MLP or CA variants perform better.
  4. Dataset-specific findings: On REDD, channel attention (CA+SpA) generalizes best on average (\(-11\%\) MAE).

  5. Cross-dataset transfer: FFA generalizes best UK-DALE\(\rightarrow\)REDD; CA+SpA best REDD\(\rightarrow\)UK-DALE.

  6. Resource tradeoffs: SA variants require \(\sim2\times\) GFLOPs and \(\sim4\times\) training time; CA adds only \(\sim4\%\) overhead.

  7. Practical recommendation: Attention selection should be treated as a domain-sensitive hyperparameter, not a universal improvement.

20.2.2 What do we know already?

This paper connects directly to the AC power systems content from Lecture 8, particularly the sections on power measurement, VI trajectories, and load identification. The table below maps each key concept from the paper to its origin in our course material:

Paper Concept Lecture 8 Source Connection
Aggregate power signal (mains) Instantaneous power: \(p[n] = v[n] \cdot i[n]\) The S2P model’s input is exactly the aggregate active power time series that you learned to compute from raw V/I samples
Per-appliance power signatures VI trajectories as load fingerprints The paper’s goal (disaggregation) is the ML generalization of the VI-based load identification concept from Lecture 8
Appliance characteristics (transient vs. steady-state) RLC transient behavior and load types The paper’s finding that attention type depends on appliance dynamics (sharp transients vs. long plateaus) maps directly to the resistive vs. reactive vs. non-linear load taxonomy
Sampling rate and window size Sub-cycle vs. cycle-level measurement The paper uses 6s sampling / 599-sample windows (\(\sim1\) hour); Lecture 8 discussed tradeoffs between temporal resolution and noise averaging
NILM concept VI trajectory section on load identification Lecture 8 introduced NILM as a motivation for understanding VI trajectories; this paper is a state-of-the-art NILM method

Key extension beyond Lecture 8: Lecture 8 introduced NILM via physics-based VI trajectories—single-cycle fingerprints compared via pattern matching. This paper shows how deep learning (specifically attention-augmented CNNs) can learn to disaggregate directly from low-rate active power time series without sub-cycle V/I waveforms, capturing temporal patterns across minutes rather than within a single 16 ms cycle. Where VI trajectories require high-frequency sampling (kHz), S2P operates on 6-second intervals and relies on learned temporal features rather than physics-derived signatures.

TipThings to learn more about

To fully understand this paper’s contributions, you may need background on:

  1. The Transformer architecture and attention mechanisms—the core ML concept underlying the SA variants (covered in Section 20.3.2 below)
  2. Convolutional Neural Networks (CNNs) for time series—the baseline S2P model uses 1D convolutions to extract features from the aggregate power sequence
  3. Sequence-to-Point (S2P) learning—the specific framing where a window of input maps to a single midpoint output value
  4. Squeeze-and-Excitation (SE) networks—the basis for Channel Attention, where convolutional filter outputs are reweighted by learned importance scores
  5. Positional encoding—how sequence order information is injected into attention models that are otherwise permutation-invariant
  6. UK-DALE and REDD datasets—standard NILM benchmarks used for evaluation

20.3 Methods

20.3.1 The Sequence-to-Point (S2P) Baseline

The S2P model (Zhang et al. 2018) is a CNN designed specifically for NILM. Its architecture is straightforward:

  1. Input: A window of \(W = 599\) aggregate power samples at 6-second resolution (\(\sim1\) hour of data)
  2. Feature extraction: 5 successive 1D convolutional layers with increasing filter counts (30, 30, 40, 50, 50) and kernel size 10
  3. Prediction head: Flatten \(\rightarrow\) Dense(1024) \(\rightarrow\) Dense(1) \(\rightarrow\) single scalar output
  4. Output: The predicted power consumption of a single target appliance at the midpoint of the input window

Connection to Lecture 8: The input sequence is exactly the active power time series \(p[n]\) from the power measurement section—except measured at the whole-house mains rather than at individual appliances, and sampled at low rate (6s) rather than sub-cycle.

A separate S2P model is trained for each target appliance. The convolutional layers learn to detect temporal patterns in the aggregate signal that correspond to the target appliance’s operation—effectively learning a data-driven version of the “load fingerprinting” concept from Lecture 8’s VI trajectory discussion.

20.3.2 Attention Mechanisms: Intuition

The core idea behind attention is simple: not all parts of the input are equally important for the prediction. Attention lets the model learn where to focus.

Consider disaggregating a microwave from the aggregate signal. A microwave has a distinctive sharp on/off pattern—a sudden jump to \(\sim1200\)W followed by a plateau and then a sharp drop. The S2P model should focus on these transitions and ignore the background steady-state. Without attention, the dense layers must learn this selectivity implicitly from the flat feature vector. With attention, the model can explicitly learn to weight the relevant time steps (or feature channels) more heavily before the features reach the dense head.

This intuition maps directly to a finding in the paper: SA+MLP+PE (which can learn to attend to specific temporal positions) gives the largest improvement for transient-dominated appliances like the microwave (\(-45\%\) MAE on UK-DALE).

20.3.3 The Transformer Architecture and Self-Attention

The Self-Attention (SA) variants in this paper are derived from the Transformer architecture (Vaswani et al. 2017), originally developed for machine translation and now the foundation of modern large language models. Understanding how self-attention works is key to understanding the paper’s most powerful (and most expensive) attention family.

20.3.3.1 Query, Key, Value: The Core Mechanism

Self-attention operates on a sequence of input vectors \(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n\). For each position \(i\), it computes three vectors via learned linear projections:

  • Query \(\mathbf{q}_i = W_Q \mathbf{x}_i\): “What am I looking for?”
  • Key \(\mathbf{k}_j = W_K \mathbf{x}_j\): “What do I contain?”
  • Value \(\mathbf{v}_j = W_V \mathbf{x}_j\): “What information do I provide?”

The output for position \(i\) is a weighted sum of all value vectors, where the weights are determined by how well each key matches the query:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

where:

  • \(QK^T\) computes the dot product between every query-key pair (an \(n \times n\) attention matrix)
  • \(\sqrt{d_k}\) is a scaling factor to prevent the dot products from becoming too large (which would push softmax into saturation)
  • softmax normalizes each row so the attention weights sum to 1

Intuition for NILM: In the S2P context, each position in the sequence corresponds to a time step’s feature representation after the convolutional layers. Self-attention allows every time step to “look at” every other time step and decide which ones are relevant. For a microwave disaggregation, the time steps near the on/off transitions would learn to attend to each other, reinforcing the sharp-edge pattern.

20.3.3.2 Multi-Head Attention

Rather than computing a single attention function, the Transformer runs multiple attention heads in parallel, each with its own \(W_Q\), \(W_K\), \(W_V\) projections:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O\]

where each \(\text{head}_i = \text{Attention}(Q W_Q^i, K W_K^i, V W_V^i)\).

Each head can learn to attend to different aspects of the input—one head might focus on sharp transients, another on periodic patterns, and another on baseline power levels.

20.3.3.3 Positional Encoding

Self-attention is permutation-invariant: shuffling the input sequence would produce the same attention weights (since it only compares vectors, not positions). For time-series data where ordering matters, this is a problem. Positional encoding solves it by adding position-dependent signals to the input:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)\]

These sinusoidal functions give each position a unique signature and allow the model to learn relative position relationships.

Connection to the paper: The SA+MLP+PE variant includes positional encoding, while SA and SA+MLP do not. The paper finds that PE provides a significant boost for transient appliances (microwave, kettle), which makes sense: knowing when in the window a sharp transition occurs is critical for identifying these appliances.

20.3.3.4 The Transformer Block

A complete Transformer encoder block combines these elements:

  1. Multi-Head Self-Attention
  2. Add & Layer Normalize (residual connection)
  3. Feed-Forward Network (two dense layers with nonlinearity)
  4. Add & Layer Normalize (residual connection)

Connection to the paper: The SA+MLP+PE variant (Fig. 2g in the paper) is essentially one Transformer encoder block applied to the S2P feature maps after the convolutional layers. It is the most powerful but also the most expensive variant (\(\sim 2 \times\) GFLOPs, \(\sim 4 \times\) training time compared to the baseline).

20.3.4 The Three Attention Families in the Paper

The paper evaluates seven attention variants organized into three families. All are inserted at the same point in the S2P architecture: between the last convolutional layer and the dense head.

To build intuition for how each family works, it helps to think about the shape of the tensor entering the attention module. After 5 convolutional layers, the original 599-sample input has been transformed into a 2D feature map with two axes:

  • \(C\) channels (= 50, one per convolutional filter in the last layer): each channel captures a different learned pattern (e.g., sharp edges, slow ramps, periodic oscillations)
  • \(T\) time positions (roughly the length of the input, reduced slightly by each conv layer): each position corresponds to a local region of the original power time series

So the attention input is a \((C \times T)\) matrix per sample—50 feature channels, each containing a time series of activations. The three attention families differ in which axis they attend to and how they compute importance weights.

20.3.4.1 Channel Attention (CA)

Based on Squeeze-and-Excitation (SE) networks (Hu et al. 2018). CA asks: “Which of the \(C\) filters are most important?” It collapses the time axis entirely and produces a per-channel weight:

  1. Squeeze (global average pool over time): Each channel’s \(T\)-long activation is averaged to a single scalar, reducing the \((C \times T)\) feature map to a \(C\)-dimensional vector—a compact “channel descriptor.”
  2. Excitation (MLP + sigmoid): A small bottleneck MLP (\(C \rightarrow C/r \rightarrow C\) with reduction ratio \(r\)) learns inter-channel dependencies and outputs \(C\) importance weights in \([0, 1]\) via sigmoid.
  3. Scale (broadcast multiply): Each channel’s entire time series is multiplied by its scalar weight. Output shape is the same \((C \times T)\), but channels have been reweighted.

The key property: CA is blind to temporal position. It can amplify a channel that captures “sharp on/off edges” and suppress one that captures “slow drift,” but it cannot say “focus on time step 300.” This is also why it’s so cheap—the MLP operates on a 50-dimensional vector, not the full \(C \times T\) tensor.

Two variants:

Variant Description Key Feature
CA Standard SE block Reweights filters globally
CA+SpA SE + spatial (position-wise) attention Also reweights time positions

CA+SpA adds a second stage that asks “Which time positions matter?” After the channel reweighting, it pools across channels (not time) at each position to get a \(T\)-dimensional descriptor, applies a 1D convolution, and produces a per-position weight in \([0, 1]\). The result is a tensor reweighted along both axes: important channels and important time positions are amplified.

Overhead: Minimal—only \(\sim4\%\) additional parameters and GFLOPs. This makes CA attractive when computational budget is limited.

20.3.4.2 Feed-Forward Attention (FFA)

Based on Bahdanau attention (Bahdanau et al. 2015). Where CA attends along the channel axis, FFA attends along the time axis: it asks “How important is each time position?”

  1. Position-wise MLP: A two-layer MLP is applied independently to each time position’s \(C\)-dimensional feature vector, producing a single scalar score per position. Think of it as sliding the same small network across all \(T\) positions.
  2. tanh activation: The scores pass through tanh, giving values in \([-1, 1]\) (so FFA can actually flip the sign of a position’s contribution, not just suppress it).
  3. Reweight (broadcast multiply): Each position’s entire \(C\)-dimensional feature vector is scaled by its score. Output shape is again \((C \times T)\).

The critical difference from CA: FFA can focus on “the moment the microwave turns on” (a specific time position), while CA cannot. However, each position is scored independently—FFA does not model interactions between positions. It can say “time step 300 is important” but not “time step 300 is important because of what happened at time step 250.”

Two variants:

Variant Description Key Feature
FFA MLP + tanh \(\rightarrow\) position weights Lightweight position weighting
FFA+LSTM FFA preceded by bidirectional LSTM Adds sequential memory before attention

The LSTM variant addresses FFA’s independence limitation by first passing the features through a bidirectional LSTM, which injects sequential context before the attention scoring.

20.3.4.3 Self-Attention (SA)

Based on the Transformer architecture (Vaswani et al. 2017). SA fills the gap that FFA leaves: it computes pairwise interactions between all \(T\) time positions, producing a \(T \times T\) attention matrix where entry \((i, j)\) represents “how much should position \(i\) attend to position \(j\).” This means SA can learn relationships like “this time step matters because of its relationship to that other time step”—but at the cost of computing and storing that full \(T \times T\) matrix.

Three variants:

Variant Description Key Feature
SA Raw self-attention only Minimal SA, no feed-forward
SA+MLP SA + feed-forward network Adds nonlinear transformation
SA+MLP+PE SA + MLP + positional encoding Full Transformer encoder block

SA is the most powerful family (it can model arbitrary position-to-position dependencies) but also the most expensive (\(\sim2\times\) GFLOPs, \(\sim4\times\) training time). The \(T \times T\) attention matrix is the main source of this cost.

Summary of what each family can “see”:

Family Attends along Can distinguish time positions? Models position-to-position interactions?
CA Channels No No
FFA Time Yes (independently) No
SA Time Yes (jointly) Yes (\(T \times T\) attention matrix)

20.3.5 Key Results Summary

The paper’s main finding is that no single attention variant dominates across all settings. Performance depends on both the appliance and the dataset:

Best variants by appliance type (UK-DALE):

Appliance Best Attention MAE Change Why It Works
Microwave SA+MLP+PE \(-45\%\) Sharp on/off transients benefit from positional and pairwise attention
Kettle SA+MLP+PE \(-22\%\) Similar transient pattern to microwave
Washing machine CA+SpA \(-15\%\) Multi-phase operation benefits from spatial + channel reweighting
Fridge SA+MLP \(-8\%\) Periodic steady-state pattern; PE not needed
Dish washer FFA \(-5\%\) Long, complex cycles; lightweight attention suffices

Cross-dataset generalization:

Transfer Direction Best Attention Average MAE Change
UK-DALE \(\rightarrow\) REDD FFA Best generalization
REDD \(\rightarrow\) UK-DALE CA+SpA Best generalization
Within REDD CA+SpA \(-11\%\) average

Resource tradeoffs:

Family Additional GFLOPs Training Time Increase Parameters Added
CA \(\sim4\%\) Minimal \(\sim4\%\)
FFA \(\sim10\%\) \(\sim1.5\times\) \(\sim8\%\)
SA \(\sim100\% (2\times)\) \(\sim4\times\) \(\sim30\) to \(50\%\)

Practical takeaway: Attention type should be treated as a domain-sensitive hyperparameter. If computational budget is limited, CA variants offer the best cost-performance ratio. If maximum accuracy on transient appliances is the goal, SA+MLP+PE is worth the compute cost.

20.4 Additional Resources

For those of you who want to go deeper into the ML concepts underlying this paper: