4 Module 3: Processing Time Series Data

4.1 Module Overview

4.1.1 Learning Objectives

By the end of this module, students will be able to:

Be able to calculate error propagation for linear Gaussian models
Understand the reasons for why error propagation is different when models are not linear
Be able to conduct outlier removal in time series using Chauvenet’s criterion
Be able to follow the derivation of linear regression with linear and non-linear basis functions
Use linear regression to interpolate or extrapolate from the time-series data that is available (estimating unobserved values in the time domain)

4.1.2 Topics Covered

Error propagation for linear Gaussian models
Non-linear models and error propagation
Outlier removal using Chauvenet’s criterion
Linear regression derivation (linear and non-linear basis functions)
Time series interpolation and extrapolation

4.1.3 Project Milestones

Post on Piazza your specific group assignments and/or your need to join a group for the final project.

4.2 Source Material

4.2.1 Recap: Uncertainty of Measurements Under Noise

In Module 2, we introduced the concept that real-world sensors produce measurements corrupted by noise. Let’s revisit this fundamental idea and make it more concrete, as understanding measurement uncertainty is critical for everything we’ll do in processing time series data.

4.2.1.1 The Measurement Model

Imagine we want to measure the temperature in a room using a digital thermometer. The true temperature (which we’ll never know exactly) is some value \(q\). When we take a measurement with our sensor, we get a reading \(y\). If our sensor were perfect, we’d have \(y = q\). But real sensors aren’t perfect—they’re affected by electronic noise, environmental factors, and inherent measurement limitations.

We can model this imperfection mathematically as:

\[y = q + n\]

where:

\(q\) is the true physical quantity we’re trying to measure
\(y\) is our actual measurement (what the sensor reports)
\(n\) is the measurement noise

For many practical sensors, this noise \(n\) can be well-approximated as additive white Gaussian noise. This means:

The noise is additive (it simply adds to the true value)
It’s white (noise at one time point doesn’t depend on noise at other time points)
It follows a Gaussian (Normal) distribution with mean zero and some variance \(\sigma^2\)

We write this as: \(n \sim \mathcal{N}(0, \sigma^2)\)

4.2.1.2 Why Gaussian Noise?

You might wonder: why do we assume Gaussian noise? While not all noise is perfectly Gaussian, there are good reasons this assumption is both practical and theoretically justified:

Central Limit Theorem: When noise comes from many small, independent sources (thermal noise, quantization errors, electromagnetic interference, etc.), their sum tends toward a Gaussian distribution
Mathematical tractability: Gaussian distributions have nice mathematical properties that make analysis possible
Empirical observation: Many real sensors do exhibit approximately Gaussian noise

4.2.1.3 Understanding Uncertainty

Now, what does this noise model tell us about uncertainty? There are two perspectives we might care about:

Forward uncertainty - \(P(y|q)\): “Given that the true value is \(q\), what measurement \(y\) might I observe?”

Since \(y = q + n\) and \(n \sim \mathcal{N}(0, \sigma^2)\), we have:

\[P(y|q) = \mathcal{N}(q, \sigma^2) \tag{4.1}\]

This says: if the true value is \(q\), our measurements will be normally distributed around \(q\) with variance \(\sigma^2\).

Inverse uncertainty - \(P(q|y)\): “Given that I observed measurement \(y\), what might the true value \(q\) be?”

This is actually what we care about most! There are many ways to arrive at this quantity. A simple one is to re-arrange our \(y = q + n\) equation. Since we were able to derive that \(P(y|q)\) is Normal in (Equation 4.1), we can easily see that \(q = y - n\) should also follow a distribution \(P(q|y)\) that is normal with mean \(y\) and variance \(\sigma^2\), namely:

\[P(q|y) = \mathcal{N}(y, \sigma^2)\]

This result shows that both distributions have the same form and the same variance. Given a measurement \(y\), our best estimate of the true value \(q\) is simply \(y\) itself, but we remain uncertain by an amount characterized by \(\sigma^2\).

4.2.1.4 Estimating the Noise Level

How do we determine \(\sigma^2\) in practice? If we can take multiple measurements \(y_1, y_2, \ldots, y_N\) of the same physical quantity \(q\) (which remains constant), we can estimate the variance from the sample variance:

\[\hat{\sigma}^2 = \frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2\]

where \(\bar{y} = \frac{1}{N}\sum_{i=1}^{N} y_i\) is the sample mean.

This gives us a practical way to characterize our sensor’s uncertainty: collect repeated measurements under stable conditions and compute the sample variance.

4.2.2 Error Propagation for Virtual Sensors

In Module 2, we introduced the concept of virtual sensors—measurements that are computed from other sensor measurements rather than directly measured. For example, if we measure the water levels at the top and bottom of a dam, we might compute the pressure difference as a virtual sensor. A key question arises: if our physical sensors have uncertainty, how does that uncertainty propagate to our virtual sensors?

4.2.2.1 A Motivating Example: Averaging Two Measurements

Let’s start with the simplest possible case. Suppose we have two sensors measuring the same physical quantity \(q\):

\[y_1 = q + n_1, \quad n_1 \sim \mathcal{N}(0, \sigma_1^2)\] \[y_2 = q + n_2, \quad n_2 \sim \mathcal{N}(0, \sigma_2^2)\]

where \(n_1\) and \(n_2\) are independent (noise in one sensor doesn’t affect the other).

To get a better estimate, we might average the two measurements to create a virtual sensor:

\[z = \frac{1}{2}y_1 + \frac{1}{2}y_2\]

Question: What is the noise affecting \(z\)? Is it better or worse than using just one sensor?

Let’s work this out. Substituting our measurement model:

\[z = \frac{1}{2}(q + n_1) + \frac{1}{2}(q + n_2) = q + \frac{1}{2}n_1 + \frac{1}{2}n_2\]

So the virtual sensor has noise \(n_z = \frac{1}{2}n_1 + \frac{1}{2}n_2\). What are the properties of this noise?

Mean: Since both \(n_1\) and \(n_2\) have zero mean: \[E[n_z] = \frac{1}{2}E[n_1] + \frac{1}{2}E[n_2] = 0\]

Great! The virtual sensor is still unbiased (zero mean noise).

Variance: Using the independence of \(n_1\) and \(n_2\): \[\text{Var}(n_z) = \text{Var}\left(\frac{1}{2}n_1 + \frac{1}{2}n_2\right) = \frac{1}{4}\text{Var}(n_1) + \frac{1}{4}\text{Var}(n_2) = \frac{1}{4}\sigma_1^2 + \frac{1}{4}\sigma_2^2\]

Special case: If both sensors have the same noise level (\(\sigma_1 = \sigma_2 = \sigma\)), then: \[\text{Var}(n_z) = \frac{1}{4}\sigma^2 + \frac{1}{4}\sigma^2 = \frac{1}{2}\sigma^2\]

By averaging two independent measurements, we’ve reduced the variance by half. The standard deviation (the square root of variance) is reduced by a factor of \(\sqrt{2} \approx 0.707\). This is why taking multiple measurements and averaging them improves accuracy.

Distribution: Since \(n_1\) and \(n_2\) are Gaussian and we’re taking a linear combination, \(n_z\) is also Gaussian (this is a key property of Gaussian distributions). Therefore:

\[z = q + n_z, \quad n_z \sim \mathcal{N}\left(0, \frac{\sigma^2}{2}\right)\]

4.2.2.2 The General Linear Case

Now let’s generalize. Suppose we have \(m\) sensors measuring different (or the same) physical quantities:

\[y_k = q_k + n_k, \quad n_k \sim \mathcal{N}(0, \sigma_k^2), \quad k = 1, 2, \ldots, m\]

And we create a virtual sensor as a linear combination of these measurements:

\[z_i = \sum_{k=1}^{m} \alpha_k y_{k,i} + c\]

where \(\alpha_k\) are weighting coefficients and \(c\) is a constant offset.

Following the same logic as before:

\[z_i = \sum_{k=1}^{m} \alpha_k (q_k + n_k) + c = \left(\sum_{k=1}^{m} \alpha_k q_k + c\right) + \sum_{k=1}^{m} \alpha_k n_k\]

The virtual sensor has noise: \[n_z = \sum_{k=1}^{m} \alpha_k n_k\]

Mean: Still zero (since each \(n_k\) has zero mean)

Variance: Assuming independent noise across sensors: \[\text{Var}(n_z) = \sum_{k=1}^{m} \alpha_k^2 \sigma_k^2\]

Distribution: \(n_z \sim \mathcal{N}\left(0, \sum_{k=1}^{m} \alpha_k^2 \sigma_k^2\right)\)

Key insight: The variance of the virtual sensor depends on:

The variances of the input sensors (\(\sigma_k^2\))
The squares of the coefficients (\(\alpha_k^2\))

This means:

Multiplying a measurement by a factor \(\alpha\) multiplies its variance by \(\alpha^2\)
Adding independent noisy measurements adds their variances
The constant \(c\) doesn’t affect uncertainty at all (it’s just a bias)

4.2.2.3 Interactive Exploration: Virtual Sensors in Action

Let’s make this concrete with an interactive example. Imagine we have two sensors measuring water levels at different locations in a reservoir system:

Sensor 1 measures the upper level: \(y_{1,k} = q_U + n_{1,k}\)
Sensor 2 measures the lower level: \(y_{2,k} = q_L + n_{2,k}\)

From these, we create two virtual sensors:

Average level: \(z_A = \frac{1}{2}y_1 + \frac{1}{2}y_2\) (measures \(q_A = \frac{1}{2}q_U + \frac{1}{2}q_L\))
Gradient (flow rate proxy): \(z_C = \frac{1}{h}y_1 - \frac{1}{h}y_2\) where \(h\) is the vertical distance between sensors

Use the interactive visualization below to explore how changing the true levels (\(q_U\), \(q_L\)) and noise levels (\(\sigma_U\), \(\sigma_L\)) affects both the raw measurements and the virtual sensors:

Note

The following activity needs to be completed in an interactive version of this document (i.e, the HTML output served by a webserver).

viewof qU = Inputs.range([0, 1000], {value: 500, step: 10, label: "True Upper Level (qU)"})
viewof qL = Inputs.range([0, 1000], {value: 300, step: 10, label: "True Lower Level (qL)"})
viewof sigmaU = Inputs.range([0, 200], {value: 150, step: 10, label: "Upper Sensor Noise (σU)"})
viewof sigmaL = Inputs.range([0, 200], {value: 150, step: 10, label: "Lower Sensor Noise (σL)"})
viewof h = Inputs.range([0.05, 0.5], {value: 0.1, step: 0.05, label: "Vertical Distance (h)"})

function randn_bm() {
  let u = 0, v = 0;
  while(u === 0) u = Math.random();
  while(v === 0) v = Math.random();
  return Math.sqrt(-2.0 * Math.log(u)) * Math.cos(2.0 * Math.PI * v);
}

// Generate time series data
data = {
  const N = 51;
  const k = Array.from({length: N}, (_, i) => i);

  // Generate noisy measurements
  const y1 = k.map(() => qU + sigmaU * randn_bm());
  const y2 = k.map(() => qL + sigmaL * randn_bm());

  // Calculate virtual sensors
  const zA = y1.map((y1_val, i) => 0.5 * y1_val + 0.5 * y2[i]);
  const zC = y1.map((y1_val, i) => (1/h) * y1_val - (1/h) * y2[i]);

  // True physical quantities
  const qA_true = 0.5 * qU + 0.5 * qL;
  const qC_true = (1/h) * qU - (1/h) * qL;

  return {
    k: k,
    qU_true: k.map(() => qU),
    qL_true: k.map(() => qL),
    y1: y1,
    y2: y2,
    zA: zA,
    zC: zC,
    qA_true: k.map(() => qA_true),
    qC_true: k.map(() => qC_true)
  };
}

// Plot 1: Raw sensor measurements
Plot.plot({
  title: "Raw Sensor Measurements",
  width: 800,
  height: 300,
  x: {label: "Time index (k)"},
  y: {label: "Water level"},
  marks: [
    Plot.line(data.k.map((k, i) => ({k, value: data.qU_true[i]})),
              {x: "k", y: "value", stroke: "#1f77b4", strokeWidth: 2, strokeDasharray: "5,5"}),
    Plot.line(data.k.map((k, i) => ({k, value: data.qL_true[i]})),
              {x: "k", y: "value", stroke: "#d62728", strokeWidth: 2, strokeDasharray: "5,5"}),
    Plot.dot(data.k.map((k, i) => ({k, value: data.y1[i]})),
             {x: "k", y: "value", fill: "#17becf", fillOpacity: 0.6, r: 4}),
    Plot.dot(data.k.map((k, i) => ({k, value: data.y2[i]})),
             {x: "k", y: "value", fill: "#ff7f0e", fillOpacity: 0.6, r: 4})
  ]
})

html`<div style="display: flex; justify-content: center; gap: 20px; margin: 10px 0; font-size: 14px;">
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="3"><line x1="0" y1="1.5" x2="30" y2="1.5" stroke="#1f77b4" stroke-width="2" stroke-dasharray="5,5"/></svg>
    <span>q<sub>U</sub> (true)</span>
  </div>
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="10"><circle cx="15" cy="5" r="4" fill="#17becf" opacity="0.6"/></svg>
    <span>y<sub>1</sub> (measured)</span>
  </div>
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="3"><line x1="0" y1="1.5" x2="30" y2="1.5" stroke="#d62728" stroke-width="2" stroke-dasharray="5,5"/></svg>
    <span>q<sub>L</sub> (true)</span>
  </div>
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="10"><circle cx="15" cy="5" r="4" fill="#ff7f0e" opacity="0.6"/></svg>
    <span>y<sub>2</sub> (measured)</span>
  </div>
</div>`

// Plot 2: Average virtual sensor
Plot.plot({
  title: "Virtual Sensor: Average Level (zA = 0.5·y1 + 0.5·y2)",
  width: 800,
  height: 300,
  x: {label: "Time index (k)"},
  y: {label: "Average level"},
  marks: [
    Plot.line(data.k.map((k, i) => ({k, value: data.qA_true[i]})),
              {x: "k", y: "value", stroke: "#2ca02c", strokeWidth: 2, strokeDasharray: "5,5"}),
    Plot.dot(data.k.map((k, i) => ({k, value: data.zA[i]})),
             {x: "k", y: "value", fill: "#98df8a", fillOpacity: 0.7, r: 4})
  ]
})

html`<div style="display: flex; justify-content: center; gap: 20px; margin: 10px 0; font-size: 14px;">
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="3"><line x1="0" y1="1.5" x2="30" y2="1.5" stroke="#2ca02c" stroke-width="2" stroke-dasharray="5,5"/></svg>
    <span>q<sub>A</sub> (true)</span>
  </div>
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="10"><circle cx="15" cy="5" r="4" fill="#98df8a" opacity="0.7"/></svg>
    <span>z<sub>A</sub> (measured)</span>
  </div>
</div>`

// Plot 3: Gradient virtual sensor
Plot.plot({
  title: `Virtual Sensor: Gradient (zC = (1/${h.toFixed(2)})·y1 - (1/${h.toFixed(2)})·y2)`,
  width: 800,
  height: 300,
  x: {label: "Time index (k)"},
  y: {label: "Gradient"},
  marks: [
    Plot.line(data.k.map((k, i) => ({k, value: data.qC_true[i]})),
              {x: "k", y: "value", stroke: "#9467bd", strokeWidth: 2, strokeDasharray: "5,5"}),
    Plot.dot(data.k.map((k, i) => ({k, value: data.zC[i]})),
             {x: "k", y: "value", fill: "#c5b0d5", fillOpacity: 0.7, r: 4})
  ]
})

html`<div style="display: flex; justify-content: center; gap: 20px; margin: 10px 0; font-size: 14px;">
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="3"><line x1="0" y1="1.5" x2="30" y2="1.5" stroke="#9467bd" stroke-width="2" stroke-dasharray="5,5"/></svg>
    <span>q<sub>C</sub> (true)</span>
  </div>
  <div style="display: flex; align-items: center; gap: 5px;">
    <svg width="30" height="10"><circle cx="15" cy="5" r="4" fill="#c5b0d5" opacity="0.7"/></svg>
    <span>z<sub>C</sub> (measured)</span>
  </div>
</div>`

// Display calculated uncertainties
html`<div style="background-color: #f0f0f0; padding: 15px; margin: 10px 0; border-radius: 5px;">
  <h4>Theoretical Uncertainties (Standard Deviations):</h4>
  <ul>
    <li><b>Upper sensor:</b> σ<sub>U</sub> = ${sigmaU.toFixed(1)}</li>
    <li><b>Lower sensor:</b> σ<sub>L</sub> = ${sigmaL.toFixed(1)}</li>
    <li><b>Average virtual sensor:</b> σ<sub>A</sub> = √((0.5²·${sigmaU.toFixed(1)}²) + (0.5²·${sigmaL.toFixed(1)}²)) = ${Math.sqrt(0.25 * sigmaU * sigmaU + 0.25 * sigmaL * sigmaL).toFixed(1)}</li>
    <li><b>Gradient virtual sensor:</b> σ<sub>C</sub> = √((1/${h.toFixed(2)})²·${sigmaU.toFixed(1)}² + (1/${h.toFixed(2)})²·${sigmaL.toFixed(1)}²) = ${Math.sqrt((1/h)*(1/h) * sigmaU * sigmaU + (1/h)*(1/h) * sigmaL * sigmaL).toFixed(1)}</li>
  </ul>
  <p><i>Notice how:</i></p>
  <ul>
    <li>The <b>average</b> virtual sensor has <i>reduced</i> uncertainty compared to individual sensors (coefficients < 1)</li>
    <li>The <b>gradient</b> virtual sensor has <i>amplified</i> uncertainty (coefficients = 1/h > 1)</li>
  </ul>
</div>`

Experiment with the controls:

Try increasing both \(\sigma_U\) and \(\sigma_L\)—watch how noise increases in all sensors
Set \(q_U = q_L\)—the gradient should be near zero, but notice the noise!
Decrease \(h\) (sensors closer together)—the gradient uncertainty explodes!
Compare the scatter in the average plot versus the individual sensor plots

This interactive example illustrates a crucial engineering trade-off: virtual sensors that amplify measurements (large coefficients) also amplify noise.

4.2.2.4 When Things Go Non-Linear: Breaking the Gaussian Assumption

Everything we’ve discussed so far relies on a critical assumption: the virtual sensor is a linear function of the measurements. But what happens when the relationship is non-linear?

Consider a practical example: measuring power dissipation in a resistor. If we measure current \(I\) flowing through a resistor with resistance \(R\):

\[I_{measured} = I_{true} + n_I, \quad n_I \sim \mathcal{N}(0, \sigma_I^2)\]

The power dissipated is given by Joule’s law (a non-linear operation with a squared term):

\[P = I^2 R\]

This raises two critical questions: 1. If \(I\) follows a Gaussian distribution, does \(P\) also follow a Gaussian distribution? 2. If not, what does the distribution of our computed power actually look like?

Let’s explore both questions with a simulation:

Code

import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# True values
I_true = 0.5    # amperes
R = 1        # ohms (known, no noise)
P_true = I_true**2 * R  # watts (true power dissipation)

# Noise level - relatively high to emphasize non-linear effects
sigma_I = 1   # amperes (200% relative noise)

# Generate many samples
N = 10000
I_measured = I_true + np.random.normal(0, sigma_I, N)

# Compute power using non-linear squared relationship
P_measured = I_measured**2 * R

# Compute errors
error_P = P_measured - P_true

# Create figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Distribution of power measurements vs Gaussian
axes[0].hist(P_measured, bins=50, density=True, alpha=0.7, color='blue',
             edgecolor='black', label='Actual P distribution')

# Overlay what a Gaussian with the same mean and variance would look like
mean_P = np.mean(P_measured)
std_P = np.std(P_measured)
x_P = np.linspace(P_measured.min(), P_measured.max(), 100)
gaussian_P = (1/(std_P * np.sqrt(2*np.pi))) * np.exp(-0.5*((x_P - mean_P)/std_P)**2)
axes[0].plot(x_P, gaussian_P, 'r-', linewidth=2, label=f'Gaussian N({mean_P:.1f}, {std_P:.1f}²)')

axes[0].axvline(P_true, color='green', linestyle='--', linewidth=2, label=f'True Power = {P_true:.1f}W')
axes[0].set_xlabel('Measured Power (W)')
axes[0].set_ylabel('Probability Density')
axes[0].set_title('Power Distribution: I ~ Gaussian ≠> P ~ Gaussian\n(Non-linear: P = I²R)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right plot: Distribution of errors with Gaussian overlay
axes[1].hist(error_P, bins=50, density=True, alpha=0.7, color='orange', edgecolor='black', label='Actual error distribution')

# Overlay what a Gaussian with the same mean and variance would look like
mean_error = np.mean(error_P)
std_error = np.std(error_P)
x = np.linspace(error_P.min(), error_P.max(), 100)
gaussian_overlay = (1/(std_error * np.sqrt(2*np.pi))) * np.exp(-0.5*((x-mean_error)/std_error)**2)
axes[1].plot(x, gaussian_overlay, 'r-', linewidth=2, label=f'Gaussian N({mean_error:.1f}, {std_error:.1f}²)')

axes[1].set_xlabel('Error in Power (W)')
axes[1].set_ylabel('Probability Density')
axes[1].set_title('Error Distribution vs Gaussian\n(Notice: NOT quite Gaussian!)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print statistics
print(f"True Power: {P_true} W")
print(f"Mean Measured Power: {np.mean(P_measured):.2f} W")
print(f"Std Dev of Power: {np.std(P_measured):.2f} W")
print(f"\nError Statistics:")
print(f"  Mean error: {mean_error:.2f} W (should be ~0 for Gaussian)")
print(f"  Std dev: {std_error:.2f} W")
print(f"  Skewness: {np.mean(((error_P - mean_error)/std_error)**3):.3f} (0 for Gaussian)")
print(f"  Kurtosis: {np.mean(((error_P - mean_error)/std_error)**4):.3f} (3 for Gaussian)")

Distribution of error in a non-linear virtual sensor (power = I²R)

True Power: 0.25 W
Mean Measured Power: 1.25 W
Std Dev of Power: 1.75 W

Error Statistics:
  Mean error: 1.00 W (should be ~0 for Gaussian)
  Std dev: 1.75 W
  Skewness: 2.729 (0 for Gaussian)
  Kurtosis: 13.906 (3 for Gaussian)

Key Observations:

The power distribution itself is NOT Gaussian (left plot): Even though the current measurements \(I\) follow a Gaussian distribution, the computed power \(P = I^2 R\) does not. The actual distribution (blue histogram) is completely different to what a Gaussian with the same mean and variance would look like (red curve). This is a fundamental consequence: non-linear transformations of Gaussian variables are generally not Gaussian. The main driving force here is the fact that power measurements are never below zero, whereas the Gaussian distribution does allow for negative values.
The error distribution is also NOT Gaussian (right plot): Similarly, the error in power estimates shows pronounced positive skew (right tail). The distribution is visibly asymmetric, and the skewness statistic deviates significantly from zero. The kurtosis also differs notably from 3 (the value for a true Gaussian).
The mean error is positive (positive bias): Even though the input noise \(n_I\) has zero mean, the mean of the power error is positive! This is because of the squared term: \[(I_{true} + n_I)^2 R = I_{true}^2 R + 2I_{true}n_I R + n_I^2 R\] The last term \(n_I^2 R\) is always positive (since it’s squared), creating a positive bias in the power estimates. Specifically, \(E[n_I^2] = \sigma_I^2\), so the expected bias is \(R\sigma_I^2\).
The distribution is skewed right: Large positive noise values in \(I\) get amplified more than negative noise values because of the squaring.
Consequences for virtual sensors:
- We can’t assume the virtual sensor output is Gaussian just because inputs are Gaussian
- We can’t use simple linear error propagation formulas
- Confidence intervals based on Gaussian assumptions will be systematically wrong
- For non-linear transformations, Monte Carlo simulation (like above) or analytical approaches like the Delta method become necessary
- Simple averaging won’t give unbiased estimates of the true power
When is this a problem?: If the noise is small relative to the signal (high signal-to-noise ratio), the non-linearity matters less. But when noise is significant or the function is highly non-linear (e.g., squares, exponentials, divisions by small numbers), the Gaussian approximation breaks down completely.

4.2.3 Data Cleansing: Dealing with Messy Real-World Data

So far, we’ve assumed our sensor data follows nice theoretical models. Reality is messier. Real sensor data often suffers from:

Incompleteness: Missing measurements due to communication failures, power outages, or sensor downtime
Inconsistency: Contradictory readings from redundant sensors or physically impossible values
Noise: Not just Gaussian noise, but also spikes, drift, and other artifacts

Data cleansing (also called data cleaning or preprocessing) is the process of detecting and correcting (or removing) these issues to prepare data for analysis.

4.2.3.1 The Two Faces of Outliers

An outlier is a measurement that deviates significantly from the expected pattern or trend. But here’s the critical question: what does it mean?

Sometimes outliers are the signal:

A sudden spike in structural vibrations might indicate damage
An unusual temperature reading might signal a fire
Anomalous traffic patterns could indicate an accident

Sometimes outliers are noise:

A communication error transmits garbage data
A sensor briefly malfunctions
Electromagnetic interference causes a spurious reading

The key is understanding your application: Are you looking for anomalies (keep the outliers!) or modeling normal behavior (remove them)?

4.2.3.2 Common Causes of Outliers

Physical perturbations: Real events that cause genuine deviations (e.g., a truck passing over a bridge causes a vibration spike)
Sensor malfunctioning: Hardware issues like degraded sensors, calibration drift, or component failure
Communication errors: Bit flips, packet loss, or transmission interference corrupting data in transit
Environmental factors: External conditions affecting sensor performance (temperature, humidity, electromagnetic fields)

4.2.3.3 Missing Values: Strategies for Handling Gaps

When data points are missing, you have several options:

Strategy	When to Use	Pros	Cons
Ignore the gap	Time-series analysis methods that handle irregular sampling	Simple, no assumptions	Reduces effective sample size
Manual filling	Critical data with small gaps where you have context	Most accurate if done carefully	Not scalable, labor-intensive
Default/flag value	When you need complete arrays but want to mark missing data	Preserves data structure	Must remember to handle flags in analysis
Mean imputation	Random, occasional gaps in stationary signals	Simple, maintains mean	Reduces variance artificially
Linear interpolation	Smoothly varying signals with small gaps	Reasonable for smooth trends	Poor for rapidly changing signals

Example: If you’re measuring bridge deflection (a smoothly varying quantity) and miss 2 readings in a sequence of 1000, linear interpolation is reasonable. But if you’re monitoring binary events (door open/closed), interpolation makes no physical sense.

4.2.3.4 Statistical Approach to Outlier Detection

A principled approach to outlier detection uses probabilistic reasoning:

Model the normal behavior: Estimate a probability distribution for typical measurements (e.g., Gaussian with mean \(\mu\) and variance \(\sigma^2\))
Compute probabilities: For each measurement, compute how likely it is under the model
Flag low-probability points: Measurements with very low probability are outliers

For Gaussian models, this translates to: “flag points that are too many standard deviations away from the mean.”

4.2.3.5 Quick Review: Sample Statistics

Before we dive into a specific outlier detection method, let’s recall how we estimate population parameters from samples.

If we have measurements \(y_1, y_2, \ldots, y_N\) from our model \(y_i = q + n_i\) where \(n_i \sim \mathcal{N}(0, \sigma^2)\):

Sample mean (estimates \(q\)): \[\bar{y} = \frac{1}{N} \sum_{i=1}^{N} y_i\]

Sample variance (estimates \(\sigma^2\)): \[s^2 = \frac{1}{N-1} \sum_{i=1}^{N} (y_i - \bar{y})^2\]

Sample standard deviation (estimates \(\sigma\)): \[s = \sqrt{s^2}\]

Note: We divide by \(N-1\) (not \(N\)) in the variance formula to get an unbiased estimate (Bessel’s correction).

These statistics will be the foundation of our outlier detection method.

4.2.3.6 Chauvenet’s Criterion: A Practical Outlier Detection Method

One of the most straightforward outlier detection methods is Chauvenet’s criterion, developed by the American astronomer William Chauvenet in the 1860s for cleaning astronomical observations.

4.2.3.6.1 The Basic Idea

The core principle is simple: Reject measurements that are “too far” from the mean, where “too far” is defined in terms of standard deviations.

But how far is “too far”? Chauvenet’s insight was to use probability: if the chance of observing a point that extreme (given the Gaussian model) is less than 1/(2N), where N is the sample size, then it’s likely an outlier rather than genuine variation.

4.2.3.6.2 Step-by-Step Procedure

Step 1: Compute the standardized distance

For each measurement \(y_i\), compute how many standard deviations it is from the mean:

\[d_i = \frac{|y_i - \bar{y}|}{s}\]

where \(\bar{y}\) is the sample mean and \(s\) is the sample standard deviation.

This is sometimes called the “z-score” or “standardized residual.”

Step 2: Choose a threshold

Common threshold choices based on the number of standard deviations:

Threshold (\(d_{threshold}\))	P(false rejection)	Typical Use
1.0	31.7%	Too lenient (would reject valid data often)
2.0	4.6%	Moderate (2-sigma rule)
3.0	0.3%	Conservative (3-sigma rule)
4.0	0.006%	Very conservative

The probability shown is the chance that a genuinely Gaussian point would be this far from the mean by random chance (i.e., the probability of incorrectly rejecting a good measurement).

For Chauvenet’s original criterion, the threshold adapts to sample size N. A common rule: use \(d_{threshold}\) such that \(P(|Z| > d_{threshold}) < 1/(2N)\) where \(Z \sim \mathcal{N}(0,1)\).

Step 3: Reject or accept

If \(d_i \leq d_{threshold}\): Accept the measurement
If \(d_i > d_{threshold}\): Reject the measurement as an outlier

Step 4: Iterate

Here’s the crucial part: outliers contaminate our estimates of \(\bar{y}\) and \(s\)! If we have one extreme outlier, it pulls the mean toward it and inflates the standard deviation, making other outliers harder to detect.

Solution: Use an iterative approach:

Compute \(\bar{y}\) and \(s\) from all data
Apply Chauvenet’s criterion to identify outliers
Remove identified outliers
Recompute \(\bar{y}\) and \(s\) from remaining data
Repeat steps 2-4 until no new outliers are found (convergence)

Typically, this converges in 2-4 iterations.

4.2.3.6.3 Example: Cleaning Temperature Data

Let’s apply Chauvenet’s criterion to a realistic example:

Note

In this example (and future ones) I am choosing to write out the simulation and experiments using Python. A good amount of the intuition about what we are trying to simulate in the example is encoded directly in the Python code (e.g., for this case, we are simulating 100 data points coming from a Normal distribution with mean 20 Celsius and standard deviation 0.5 degrees Celsius). As a result, you may need to read through the code in order to get a better understanding of the resulting plots.

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate realistic temperature data with outliers
np.random.seed(123)
N = 100
true_temp = 20.0  # Celsius
noise_std = 0.5

# Generate mostly good data
temperatures = true_temp + np.random.normal(0, noise_std, N)

# Add some outliers (sensor glitches)
outlier_indices = [10, 25, 63, 87]
temperatures[outlier_indices] = [35.0, -5.0, 40.0, 32.0]

def chauvenets_criterion(data, threshold=2.0, max_iterations=10):
    """
    Apply Chauvenet's criterion iteratively to remove outliers.
    Returns: cleaned data, indices of outliers, iteration history
    """
    data = np.array(data, dtype=float)
    valid_mask = np.ones(len(data), dtype=bool)
    history = []

    for iteration in range(max_iterations):
        # Compute statistics on currently valid data
        valid_data = data[valid_mask]
        mean = np.mean(valid_data)
        std = np.std(valid_data, ddof=1)

        # Compute standardized distances for ALL points
        d = np.abs(data - mean) / std

        # Find outliers (only among currently valid points)
        newly_rejected = valid_mask & (d > threshold)

        # Store history
        history.append({
            'iteration': iteration + 1,
            'mean': mean,
            'std': std,
            'n_valid': np.sum(valid_mask),
            'n_rejected_this_round': np.sum(newly_rejected)
        })

        # Check for convergence
        if np.sum(newly_rejected) == 0:
            break

        # Update valid mask
        valid_mask = valid_mask & ~newly_rejected

    outlier_indices = np.where(~valid_mask)[0]
    return data[valid_mask], outlier_indices, history

# Apply Chauvenet's criterion
cleaned_data, outlier_idx, history = chauvenets_criterion(temperatures, threshold=3.0)

# Print iteration history
print("Chauvenet's Criterion - Iteration History (threshold = 3.0 σ):")
print("-" * 70)
for h in history:
    print(f"Iteration {h['iteration']}: mean={h['mean']:.3f}, std={h['std']:.3f}, "
          f"valid points={h['n_valid']}, rejected this round={h['n_rejected_this_round']}")
print("-" * 70)
print(f"Final result: {len(cleaned_data)} points retained, {len(outlier_idx)} outliers removed")
print(f"Detected outlier indices: {outlier_idx}")
print(f"Actual outlier indices:   {outlier_indices}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Original data
axes[0].scatter(range(N), temperatures, c='blue', alpha=0.6, label='All data')
axes[0].scatter(outlier_indices, temperatures[outlier_indices],
                c='red', s=100, marker='x', linewidths=3, label='Known outliers')
axes[0].axhline(true_temp, color='green', linestyle='--', linewidth=2, label='True temperature')
axes[0].set_xlabel('Measurement index')
axes[0].set_ylabel('Temperature (°C)')
axes[0].set_title('Original Data with Outliers')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: After cleaning
valid_indices = np.arange(N)
valid_indices = np.delete(valid_indices, outlier_idx)
axes[1].scatter(valid_indices, cleaned_data, c='blue', alpha=0.6, label='Retained data')
axes[1].scatter(outlier_idx, temperatures[outlier_idx],
                c='red', s=100, marker='x', linewidths=3, label='Removed as outliers')
axes[1].axhline(true_temp, color='green', linestyle='--', linewidth=2, label='True temperature')
axes[1].set_xlabel('Measurement index')
axes[1].set_ylabel('Temperature (°C)')
axes[1].set_title('After Applying Chauvenet\'s Criterion')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Compare statistics
print(f"\nStatistical comparison:")
print(f"Original data: mean={np.mean(temperatures):.3f}, std={np.std(temperatures, ddof=1):.3f}")
print(f"Cleaned data:  mean={np.mean(cleaned_data):.3f}, std={np.std(cleaned_data, ddof=1):.3f}")
print(f"True values:   mean={true_temp:.3f}, std={noise_std:.3f}")

Chauvenet's Criterion - Iteration History (threshold = 3.0 σ):
----------------------------------------------------------------------
Iteration 1: mean=20.253, std=3.785, valid points=100, rejected this round=4
Iteration 2: mean=20.034, std=0.569, valid points=96, rejected this round=0
----------------------------------------------------------------------
Final result: 96 points retained, 4 outliers removed
Detected outlier indices: [10 25 63 87]
Actual outlier indices:   [10, 25, 63, 87]

Applying Chauvenet’s criterion to remove outliers from temperature data


Statistical comparison:
Original data: mean=20.253, std=3.785
Cleaned data:  mean=20.034, std=0.569
True values:   mean=20.000, std=0.500

Key Observations:

Iteration matters: The first iteration might not catch all outliers because extreme outliers inflate the standard deviation. After removing the worst outliers, the standard deviation decreases, making remaining outliers more obvious.
Threshold selection: Using \(d_{threshold} = 3\) (3-sigma rule) is conservative and appropriate when you want to minimize false rejections. For noisier data or when outliers are more subtle, you might use a lower threshold like 2.5.
Performance: In this example, Chauvenet’s criterion successfully identified all four artificially injected outliers, and the cleaned data has statistics much closer to the true values.
Limitations:
- Assumes outliers are few (< 10-20% of data). With many outliers, the method can fail.
- Requires roughly Gaussian distribution of valid data
- Can’t distinguish between “bad outliers” (sensor errors) and “good outliers” (real extreme events)

4.2.3.6.4 When to Use Chauvenet’s Criterion

Good for:

Cleaning sensor data before modeling
Quality control in data collection
Preprocessing for regression or averaging
Small to moderate numbers of outliers (<10-20%)

Not good for:

Anomaly detection (when outliers are the goal)
Highly skewed or non-Gaussian data
Data with many outliers (>20%)
Time series with trends (detrend first!)

4.2.4 Estimating Trends in Time Series Data

Up to now, we’ve focused on understanding individual measurements and their uncertainties. But when collecting time series data from sensors, we typically want to understand patterns over time—trends, cycles, and relationships.

4.2.4.1 From Individual Points to Patterns

The simplest approach to estimating a physical quantity \(q_k\) at time \(t_i\) would be to just use the measurement at that time: \(\hat{q}_k(t_i) = y_{k,i}\). But this has serious limitations:

Problem 1: We can’t predict beyond our measurements

What if we want to estimate \(q\) at a time when we didn’t take a measurement? Or forecast future values? A single-point estimate gives us no predictive power.

Problem 2: We ignore structure

Most physical quantities don’t jump randomly—they follow patterns:

Bridge deflection varies smoothly with load
Temperature follows daily and seasonal cycles
Water levels change gradually (except during storms)

If we believe the quantity has structure (smoothness, trends, periodicities), we should use that information!

Problem 3: We don’t leverage multiple measurements

If we have measurements at times \(t_1, t_2, \ldots, t_N\), and we believe the quantity varies smoothly, then nearby measurements should inform our estimate at time \(t_i\). Using only \(y_i\) wastes information from neighboring points.

4.2.4.2 The Core Idea: Model the Trend

Instead of treating each measurement independently, we’ll fit a model that describes how the physical quantity varies over time:

\[q(t) = f(t; \boldsymbol{\theta})\]

where:

\(f\) is a function we choose (linear, polynomial, periodic, etc.)
\(\boldsymbol{\theta}\) are parameters we estimate from data

Then we’ll find the parameters \(\boldsymbol{\theta}\) that best explain all our measurements \(y_1, y_2, \ldots, y_N\).

This approach has powerful benefits:

Interpolation: Estimate \(q(t)\) at times between measurements
Extrapolation: Predict \(q(t)\) at future times (with caution!)
Noise reduction: The model “smooths out” measurement noise
Compression: Store the model parameters instead of all raw data
Insight: The model reveals underlying patterns

The simplest and most widely used method for fitting such models is linear regression.

4.2.4.3 Linear Regression: Finding the Best-Fit Line

Linear regression is one of the most fundamental tools in data analysis. Let’s build it from first principles, emphasizing the role of residuals—the key to understanding what regression is really doing.

4.2.4.3.1 The Setup: Simple Linear Regression

Suppose we have \(N\) measurements \((t_1, y_1), (t_2, y_2), \ldots, (t_N, y_N)\), and we believe the underlying physical quantity varies linearly with time:

\[q(t) = \beta_0 + \beta_1 t\]

where \(\beta_0\) is the intercept and \(\beta_1\) is the slope—these are the parameters we need to estimate.

Our measurements are noisy versions of this linear trend:

\[y_i = \beta_0 + \beta_1 t_i + n_i\]

Goal: Find the “best” values of \(\beta_0\) and \(\beta_1\) given our measurements.

4.2.4.3.2 Residuals: The Heart of Regression

For any choice of \(\beta_0\) and \(\beta_1\), we can compute the predicted value at time \(t_i\):

\[\hat{y}_i = \beta_0 + \beta_1 t_i\]

The residual \(r_i\) is the difference between the actual measurement and our prediction:

\[r_i = y_i - \hat{y}_i = y_i - (\beta_0 + \beta_1 t_i)\]

Key intuition:

If our model is perfect and noise-free, all residuals would be zero
In reality, residuals capture both measurement noise and model mismatch
If the model structure is correct, residuals estimate the noise
If the model structure is wrong, residuals include both noise and systematic error

4.2.4.3.3 The Least Squares Criterion

How do we choose the “best” \(\beta_0\) and \(\beta_1\)? We want to minimize the residuals, but some will be positive and some negative. A natural choice is to minimize the sum of squared residuals (also called the sum of squared errors, SSE):

\[\text{SSE}(\beta_0, \beta_1) = \sum_{i=1}^{N} r_i^2 = \sum_{i=1}^{N} (y_i - \beta_0 - \beta_1 t_i)^2\]

Why squares?

Squares are always positive (can’t cancel out)
Squares penalize large errors more than small ones
Squaring leads to a smooth optimization problem with a unique solution
Connects to maximum likelihood: Under Gaussian noise, minimizing SSE is equivalent to maximum likelihood estimation

4.2.4.3.4 Deriving the Solution

To find the minimum, we take derivatives with respect to \(\beta_0\) and \(\beta_1\) and set them equal to zero:

\[\frac{\partial \text{SSE}}{\partial \beta_0} = -2 \sum_{i=1}^{N} (y_i - \beta_0 - \beta_1 t_i) = 0\]

\[\frac{\partial \text{SSE}}{\partial \beta_1} = -2 \sum_{i=1}^{N} t_i(y_i - \beta_0 - \beta_1 t_i) = 0\]

Simplifying the first equation: \[\sum_{i=1}^{N} y_i = N\beta_0 + \beta_1 \sum_{i=1}^{N} t_i\]

\[\beta_0 = \bar{y} - \beta_1 \bar{t}\]

where \(\bar{y} = \frac{1}{N}\sum_i y_i\) and \(\bar{t} = \frac{1}{N}\sum_i t_i\).

Insight: The regression line always passes through the point \((\bar{t}, \bar{y})\)—the center of the data cloud.

Substituting this into the second equation and simplifying (algebra omitted for brevity):

\[\beta_1 = \frac{\sum_{i=1}^{N} (t_i - \bar{t})(y_i - \bar{y})}{\sum_{i=1}^{N} (t_i - \bar{t})^2} = \frac{\text{Cov}(t, y)}{\text{Var}(t)}\]

These are the least squares estimates for the slope and intercept.

4.2.4.3.5 Example: Fitting a Trend

Code

import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data with a linear trend plus noise
np.random.seed(42)
N = 30
t = np.linspace(0, 10, N)
beta_0_true = 5.0
beta_1_true = 2.0
noise_std = 3.0

y = beta_0_true + beta_1_true * t + np.random.normal(0, noise_std, N)

# Compute least squares estimates
t_bar = np.mean(t)
y_bar = np.mean(y)
beta_1 = np.sum((t - t_bar) * (y - y_bar)) / np.sum((t - t_bar)**2)
beta_0 = y_bar - beta_1 * t_bar

# Predictions and residuals
y_pred = beta_0 + beta_1 * t
residuals = y - y_pred

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Data and fit
axes[0].scatter(t, y, alpha=0.6, s=50, label='Measurements', color='blue')
axes[0].plot(t, y_pred, 'r-', linewidth=2, label=f'Fit: y = {beta_0:.2f} + {beta_1:.2f}t')
axes[0].plot(t, beta_0_true + beta_1_true * t, 'g--', linewidth=2, alpha=0.7,
             label=f'True: y = {beta_0_true:.2f} + {beta_1_true:.2f}t')

# Draw residual lines
for i in range(N):
    axes[0].plot([t[i], t[i]], [y[i], y_pred[i]], 'k-', alpha=0.3, linewidth=1)

axes[0].scatter([t_bar], [y_bar], s=200, marker='X', color='red',
                edgecolors='black', linewidths=2, label='Centroid', zorder=5)
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Measurement')
axes[0].set_title('Linear Regression Fit')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right plot: Residuals
axes[1].scatter(t, residuals, alpha=0.6, s=50, color='orange')
axes[1].axhline(0, color='red', linestyle='--', linewidth=2, label='Zero residual')
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Residual (y - ŷ)')
axes[1].set_title(f'Residuals (SSE = {np.sum(residuals**2):.2f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"True parameters: β₀ = {beta_0_true:.3f}, β₁ = {beta_1_true:.3f}")
print(f"Estimated parameters: β₀ = {beta_0:.3f}, β₁ = {beta_1:.3f}")
print(f"Residual standard deviation: {np.std(residuals, ddof=2):.3f}")
print(f"(Compare to true noise std: {noise_std:.3f})")

Linear regression example showing data, fit line, and residuals

True parameters: β₀ = 5.000, β₁ = 2.000
Estimated parameters: β₀ = 5.964, β₁ = 1.694
Residual standard deviation: 2.580
(Compare to true noise std: 3.000)

Key observations:

The fit line passes through the centroid (marked with X)
Residuals scatter randomly around zero (if the model is correct)
The residual standard deviation estimates the noise level
Vertical lines show each residual—these are what we minimize in sum-of-squares

4.2.4.3.6 Using Regression for Outlier Detection

Remember Chauvenet’s criterion? We can combine it with regression for cleaning data with trends:

Iterative regression-based outlier removal:

Fit a regression model to all data
Compute residuals: \(r_i = y_i - \hat{y}_i\)
Compute mean and std of residuals: \(\bar{r}\) and \(s_r\)
Remove points where \(|r_i - \bar{r}|/s_r > d_{threshold}\)
Refit the model and repeat until convergence

This is more appropriate than removing outliers based on raw \(y\) values when data has a trend, because we care about deviations from the expected trend, not from a constant mean.

4.2.4.3.7 Extension to Non-Linear Functions: The Power of Linearity in Parameters

Here’s a remarkable fact: linear regression can fit non-linear functions, as long as the function is linear in the parameters.

For example, suppose we believe our quantity follows a quadratic trend:

\[q(t) = \beta_0 + \beta_1 t + \beta_2 t^2\]

This looks non-linear (there’s a \(t^2\) term!), but it’s linear in the parameters \(\beta_0, \beta_1, \beta_2\). We can still use least squares!

Matrix Form: Define “basis functions” \(\phi_1(t), \phi_2(t), \ldots, \phi_M(t)\) and write:

\[q(t) = \sum_{j=1}^{M} \beta_j \phi_j(t) = \boldsymbol{\phi}(t)^T \boldsymbol{\beta}\]

For our quadratic example: \(\phi_1(t) = 1\), \(\phi_2(t) = t\), \(\phi_3(t) = t^2\).

With measurements \(y_1, \ldots, y_N\) at times \(t_1, \ldots, t_N\), construct the design matrix:

\[\mathbf{X} = \begin{bmatrix} \phi_1(t_1) & \phi_2(t_1) & \cdots & \phi_M(t_1) \\ \phi_1(t_2) & \phi_2(t_2) & \cdots & \phi_M(t_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_1(t_N) & \phi_2(t_N) & \cdots & \phi_M(t_N) \end{bmatrix}\]

And the measurement vector: \(\mathbf{y} = [y_1, y_2, \ldots, y_N]^T\)

The least squares solution is:

\[\boldsymbol{\hat{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\]

This single formula solves linear regression for:

Polynomials: \(\phi_j(t) = t^{j-1}\)
Fourier series: \(\phi_j(t) = \sin(j\omega t), \cos(j\omega t)\)
Exponentials: \(\phi_j(t) = e^{\lambda_j t}\)
Any combination you can imagine!

The key restriction: The model must be linear in the parameters \(\beta_j\), even if it’s non-linear in \(t\).

4.2.4.3.8 Example: Fitting a Polynomial

Code

# Generate data with a non-linear trend
np.random.seed(123)
N = 30
t = np.linspace(0, 5, N)
y_true = 2 + 3*t - 0.5*t**2  # Quadratic trend
y = y_true + np.random.normal(0, 2, N)

# Fit polynomials of degree 1, 2, and 3
degrees = [1, 2, 3]
colors = ['red', 'green', 'blue']

plt.figure(figsize=(10, 6))
plt.scatter(t, y, alpha=0.6, s=50, label='Measurements', color='black', zorder=5)

t_smooth = np.linspace(0, 5, 200)

for degree, color in zip(degrees, colors):
    # Construct design matrix
    X = np.column_stack([t**j for j in range(degree + 1)])
    X_smooth = np.column_stack([t_smooth**j for j in range(degree + 1)])

    # Least squares solution
    beta = np.linalg.solve(X.T @ X, X.T @ y)

    # Predictions
    y_pred = X @ beta
    y_smooth = X_smooth @ beta

    # Plot
    plt.plot(t_smooth, y_smooth, color=color, linewidth=2,
             label=f'Degree {degree} (SSE={np.sum((y - y_pred)**2):.1f})')

plt.xlabel('Time')
plt.ylabel('Measurement')
plt.title('Polynomial Regression: Different Model Complexities')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Important lessons:

Degree 1 (linear) underfits—it can’t capture the curvature
Degree 2 (quadratic) fits well—it matches the true model
Degree 3 (cubic) fits slightly better (lower SSE) but starts to wiggle unnecessarily

This illustrates the bias-variance tradeoff:

Too simple models have high bias (systematic error)
Too complex models have high variance (overfitting to noise)
The “just right” model matches the true complexity

4.2.4.3.9 Summary: The Power of Linear Regression

Linear regression is remarkably versatile:

Simple and interpretable: Clear parameters with physical meaning
Computationally efficient: Closed-form solution, no iterative optimization
Extensible: Works for polynomials, Fourier series, and many non-linear basis functions
Foundation for advanced methods: Underlies many machine learning algorithms

Key takeaways:

Residuals are the difference between data and model predictions
Good models have residuals that look like random noise
Least squares minimizes the sum of squared residuals
Linear regression works for non-linear functions (as long as they’re linear in parameters)
Can be combined with outlier detection for robust fitting