8  Module 7: Database Design

8.1 Module Overview

8.1.1 Learning Objectives

By the end of this module, students will be able to:

  • Explain how relational databases are designed and connect the theory behind it to earlier concepts such as Entity Relationship Diagrams and Relational Algebra
  • Explain how the properties of the data lead to certain dependencies that allow for it to be structured, and understand how different structures lead to different design choices named “normal forms”
  • Decompose a relation up to its Boyce-Codd Normal Form (BCNF), when possible
  • Derive functional dependencies given properties of the data, and draw these dependencies in a diagram
  • Understand the concept of closures and keys of relations

8.1.2 Topics Covered

  • Principles of database design
  • Properties and normal forms
  • Functional dependencies
  • Keys and closures

8.1.3 Project Milestones

Meet with your team to agree on a set of milestones and deliverables that are needed to complete the final project. This will be required as part of Assignment 4.

8.2 Lecture Notes

8.2.1 Why do we need to design a database?

In Modules 5 and 6, we learned the relational model and SQL—giving us the tools to create databases and query them. We even saw mechanical rules for translating ER diagrams into relational schemas. So we’re done, right? We can design databases!

Not quite. While you can create a database from an ER diagram, the question remains: Is your design any good?

Different designs can store the same information, but some designs lead to:

  • Wasted storage from redundant data
  • Inconsistencies when data is updated incorrectly
  • Lost information when rows are deleted
  • Maintenance nightmares as the database evolves

Database design theory helps us create schemas that avoid these problems. The good news is there’s elegant mathematical theory—based on functional dependencies and normal forms—that guides us toward better designs, often automatically.

8.2.1.1 A Poorly Designed Database

Let’s see what can go wrong. Suppose we want to track our bridge monitoring system, and we decide to store everything in one big table:

BridgeSensor (poorly designed):

Bridge_ID Bridge_Name Year_Built Length Sensor_ID Sensor_Type Install_Date Manufacturer Mfr_Country
B001 Roberto Clemente Bridge 1928 244 S001 accelerometer 2020-01-15 Acme Sensors USA
B001 Roberto Clemente Bridge 1928 244 S002 strain_gauge 2020-01-15 Acme Sensors USA
B001 Roberto Clemente Bridge 1928 244 S003 temperature 2020-02-10 TempTech Germany
B002 Fort Pitt Bridge 1959 1201 S004 accelerometer 2019-12-01 Acme Sensors USA
B002 Fort Pitt Bridge 1959 1201 S005 strain_gauge 2019-12-01 SensorCorp Japan

This design works—we can store data and query it. But look closely at the problems:

Problem 1: Redundancy

Notice how “Roberto Clemente Bridge”, “1928”, and “244” appear three times—once for each sensor on bridge B001. Similarly, “Acme Sensors” and “USA” appear three times. We’re storing the same information repeatedly, wasting space.

Problem 2: Update Anomalies

Suppose we discover the Roberto Clemente Bridge is actually 245 meters long, not 244. We must update three rows. If we forget to update one row, our database becomes inconsistent—some rows say 244, others say 245. Which is correct?

Problem 3: Deletion Anomalies

What if we remove all sensors from the Fort Pitt Bridge (maybe for maintenance)? We’d delete both rows with Bridge_ID = B002. But now we’ve lost all information about that bridge—we no longer know it exists, when it was built, or its length. The sensor data deletion accidentally deleted bridge data too.

Problem 4: Insertion Anomalies

Suppose we build a new bridge but haven’t installed sensors yet. We can’t add a row to BridgeSensor because we don’t have sensor information to fill in. We’re forced to either wait until sensors are installed, or insert NULL values for sensor columns, or create dummy sensor entries—all unsatisfying workarounds.

8.2.1.2 A Better Design: Decomposition

Now let’s decompose this into separate tables following proper design principles:

Bridge:

Bridge_ID Bridge_Name Year_Built Length
B001 Roberto Clemente Bridge 1928 244
B002 Fort Pitt Bridge 1959 1201
B003 Fort Duquesne Bridge 1969 1201

Sensor:

Sensor_ID Sensor_Type Install_Date Bridge_ID Manufacturer_ID
S001 accelerometer 2020-01-15 B001 M001
S002 strain_gauge 2020-01-15 B001 M001
S003 temperature 2020-02-10 B001 M002
S004 accelerometer 2019-12-01 B002 M001
S005 strain_gauge 2019-12-01 B002 M003

Manufacturer:

Manufacturer_ID Manufacturer Mfr_Country
M001 Acme Sensors USA
M002 TempTech Germany
M003 SensorCorp Japan

How decomposition solves the problems:

  1. No redundancy: Each bridge’s information appears exactly once. Each manufacturer’s information appears exactly once.

  2. No update anomalies: To change a bridge’s length, we update exactly one row in the Bridge table. Impossible to create inconsistency.

  3. No deletion anomalies: Removing all sensors from a bridge (deleting rows from Sensor table) doesn’t affect the Bridge table. Bridge information persists.

  4. No insertion anomalies: We can add a new bridge to the Bridge table without having any sensors. We can add a new manufacturer without having sensors from them yet.

8.2.1.3 Design Theory to the Rescue

The decomposition above looks obviously better. But how do we systematically arrive at such designs? How do we know we’re done decomposing? Could we decompose further?

This is where database design theory comes in:

  • Functional dependencies capture properties of the data (e.g., “Bridge_ID determines Bridge_Name”)
  • Normal forms (1NF, 2NF, 3NF, BCNF) provide target structures that eliminate anomalies
  • Decomposition algorithms automatically break down mega-relations into well-designed tables

The rest of this module develops this theory systematically. We’ll learn how to: 1. Identify functional dependencies from data properties 2. Recognize when a schema violates normal forms 3. Decompose relations to achieve higher normal forms (usually BCNF) 4. Verify that our decomposition preserves all information

By the end, you’ll have principled techniques for designing databases that are efficient, consistent, and maintainable.

8.2.2 Database design theory

How do we systematically design good databases? Database design theory provides a principled approach based on decomposition.

8.2.2.1 The Decomposition Approach

The core idea is simple but powerful:

  1. Start with a “mega-relation” containing all attributes in a single table
  2. Identify properties of the data (functional dependencies)
  3. Decompose systematically into smaller relations based on these properties
  4. Achieve a normal form that guarantees no anomalies and no information loss

This process can be done algorithmically—given a set of functional dependencies, we can automatically decompose a relation into a well-designed schema.

8.2.2.2 The Decomposition Process

Here’s the general workflow:

INPUT:

  • A relation \(R\) with all attributes (the “mega-relation”)
  • Functional dependencies describing properties of the data (e.g., “Bridge_ID determines Bridge_Name”)
  • Optionally: multivalued dependencies for more complex cases

PROCEDURE:

  • Analyze functional dependencies to identify problems (violations of normal forms)
  • Systematically decompose \(R\) into smaller relations \(R_1, R_2, \ldots, R_n\)
  • Each decomposition step removes one source of anomalies

OUTPUT:

  • A set of relations in a target normal form (typically BCNF or 3NF)

  • Guarantees:

    • No anomalies: No redundancy, no update/deletion/insertion anomalies
    • Lossless decomposition: We can reconstruct the original information by joining the decomposed relations
    • Dependency preservation (when possible): All original constraints are still enforceable

8.2.2.3 Normal Forms: Quality Levels for Designs

Normal forms are increasingly strict criteria for database design. Think of them as quality levels:

  • First Normal Form (1NF): Eliminate repeating groups and ensure atomic values
  • Second Normal Form (2NF): Eliminate partial dependencies on composite keys
  • Third Normal Form (3NF): Eliminate transitive dependencies
  • Boyce-Codd Normal Form (BCNF): Every determinant must be a candidate key

Each normal form eliminates certain types of anomalies. Higher normal forms are “better” but sometimes require tradeoffs. BCNF is the sweet spot for most applications—it eliminates all anomalies arising from functional dependencies.

8.2.2.4 Why This Matters

The beautiful thing about this theory is that it’s automatic and systematic:

  • You don’t need to guess what tables to create
  • You don’t rely on intuition about what “feels right”
  • The math tells you exactly how to decompose

This doesn’t mean design is mechanical—you still need to:

  • Understand your domain well enough to identify functional dependencies
  • Sometimes make tradeoffs between normal forms and query performance
  • Recognize when denormalization makes sense for specific use cases

But the theory gives you a solid foundation and a principled starting point.

8.2.2.5 Roadmap for This Module

To understand and apply this theory, we’ll proceed as follows:

  1. Functional dependencies: Learn what they are, how to identify them, and rules for reasoning about them
  2. Keys and closures: Understand how functional dependencies relate to keys
  3. Normal forms: Study 1NF, 2NF, 3NF, and BCNF in detail with examples
  4. Decomposition algorithms: See how to systematically achieve normal forms

Let’s start with functional dependencies—the foundation of everything else.

8.2.2.6 Functional Dependencies

A functional dependency (FD) is a constraint that describes a relationship between attributes. It captures the idea that knowing the value of certain attributes allows us to uniquely determine the value of other attributes.

Formal definition: Given a relation \(R\), a functional dependency \(\bar{A} \rightarrow \bar{B}\) holds if:

Whenever two tuples in \(R\) agree on all attributes in \(\bar{A}\), they must also agree on all attributes in \(\bar{B}\).

In other words, the values of attributes in \(\bar{A}\) determine or functionally determine the values of attributes in \(\bar{B}\).

Notation:

  • \(\bar{A}\) denotes a set of attributes \(\{A_1, A_2, \ldots, A_n\}\)
  • \(\bar{A} \rightarrow \bar{B}\) means “\(\bar{A}\) functionally determines \(\bar{B}\)
  • We say “\(\bar{A}\) is a determinant” and “\(\bar{B}\) depends on \(\bar{A}\)

8.2.2.7 Examples from the Bridge Monitoring System

Let’s identify functional dependencies in our BridgeSensor mega-relation:

BridgeSensor(Bridge_ID, Bridge_Name, Year_Built, Length, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)

FD 1: \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)

If two rows have the same Bridge_ID, they must have the same Bridge_Name, Year_Built, and Length.

This makes sense: a bridge ID uniquely identifies a bridge, so all properties of that bridge are determined by the ID.

FD 2: \(\text{Sensor\_ID} \rightarrow \text{Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\)

If two rows have the same Sensor_ID, they must have the same Sensor_Type, Install_Date, Bridge_ID, and Manufacturer.

A sensor ID uniquely identifies a sensor and all its properties, including which bridge it’s on and who manufactured it.

FD 3: \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)

If two rows have the same Manufacturer, they must have the same Mfr_Country.

A manufacturer’s name determines where they’re located. (We’re assuming manufacturer names are unique—if not, we’d need a Manufacturer_ID.)

Non-example: We do NOT have \(\text{Year\_Built} \rightarrow \text{Bridge\_Name}\)

Multiple bridges can be built in the same year, so knowing the year doesn’t tell us which bridge. This is not a functional dependency.

8.2.2.8 Visualizing Functional Dependencies

We can draw functional dependency diagrams to show the structure. For BridgeSensor:

Bridge_ID → Bridge_Name, Year_Built, Length
Sensor_ID → Sensor_Type, Install_Date, Bridge_ID, Manufacturer
Manufacturer → Mfr_Country

These dependencies reveal the hidden structure in our data. Notice that Sensor_ID determines Bridge_ID, and Bridge_ID in turn determines bridge properties. By transitivity, Sensor_ID determines all attributes in the relation—making it a candidate key!

8.2.2.9 Types of Functional Dependencies

Given a functional dependency \(\bar{A} \rightarrow \bar{B}\):

Trivial FD: \(\bar{B} \subseteq \bar{A}\)

Every attribute on the right side is also on the left side.

Example: \(\{\text{Bridge\_ID, Bridge\_Name}\} \rightarrow \text{Bridge\_ID}\)

This is always true (trivially) because if two tuples agree on Bridge_ID and Bridge_Name, they obviously agree on Bridge_ID alone. Trivial FDs are not interesting—they don’t tell us anything useful.

Nontrivial FD: \(\bar{B} \not\subseteq \bar{A}\)

At least one attribute on the right side is not on the left side.

Example: \(\text{Bridge\_ID} \rightarrow \{\text{Bridge\_Name, Bridge\_ID}\}\)

This is nontrivial because Bridge_Name is not in the left side.

Completely nontrivial FD: \(\bar{A} \cap \bar{B} = \emptyset\)

No attribute appears on both sides.

Example: \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built}\)

This is completely nontrivial—the sets are disjoint.

Why this matters: Completely nontrivial FDs are the most interesting for database design. They reveal true dependencies between different attributes.

8.2.2.10 Rules for Manipulating Functional Dependencies

Functional dependencies obey certain algebraic rules. These rules let us derive new FDs from existing ones.

Splitting Rule (Right Side):

If \(\bar{A} \rightarrow \{B_1, B_2, \ldots, B_m\}\), then we can split this into multiple FDs: \[\bar{A} \rightarrow B_1, \quad \bar{A} \rightarrow B_2, \quad \ldots, \quad \bar{A} \rightarrow B_m\]

Example: From \(\text{Bridge\_ID} \rightarrow \{\text{Bridge\_Name, Year\_Built, Length}\}\), we get:

  • \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name}\)
  • \(\text{Bridge\_ID} \rightarrow \text{Year\_Built}\)
  • \(\text{Bridge\_ID} \rightarrow \text{Length}\)

Combining Rule (Right Side):

Conversely, if we have multiple FDs with the same left side, we can combine them:

If \(\bar{A} \rightarrow B_1\) and \(\bar{A} \rightarrow B_2\), … and \(\bar{A} \rightarrow B_m\), then: \[\bar{A} \rightarrow \{B_1, B_2, \ldots, B_m\}\]

Important: We cannot split or combine the left side arbitrarily. The FD \(\{A_1, A_2\} \rightarrow B\) is NOT equivalent to \(A_1 \rightarrow B\) and \(A_2 \rightarrow B\).

8.2.2.11 Functional Dependencies and Keys

Keys are intimately connected to functional dependencies. In fact, a key is a set of attributes whose functional dependency extends to all attributes in the relation.

Definition: A set of attributes \(\bar{K}\) is a key for relation \(R\) if:

  1. \(\bar{K} \rightarrow \text{all attributes of } R\) (the key determines everything)
  2. No proper subset of \(\bar{K}\) has property 1 (minimality—can’t remove any attribute from \(\bar{K}\) and still have a key)

The first property means that knowing the values of key attributes uniquely identifies a tuple—no two tuples can have the same key values. The second property means the key contains no redundant attributes.

Example: In our BridgeSensor relation:

\(\{\text{Sensor\_ID}\} \rightarrow \text{all attributes}\)

This is a key because:

  • Sensor_ID determines all nine attributes in the relation (directly and via transitivity through Bridge_ID and Manufacturer)
  • It’s a single attribute, so it’s automatically minimal

Non-example: \(\{\text{Bridge\_ID, Sensor\_ID}\}\) also determines all attributes, but it’s not a key because it’s not minimal—we can remove Bridge_ID and still have a key (just Sensor_ID).

Multiple Candidate Keys:

A relation can have multiple keys. Each is called a candidate key. We typically choose one as the primary key for implementation.

Example: Consider a simpler relation:

Student(Student_ID, Email, Name, GPA)

With functional dependencies:

  • \(\text{Student\_ID} \rightarrow \text{Email, Name, GPA}\)
  • \(\text{Email} \rightarrow \text{Student\_ID, Name, GPA}\)

Both Student_ID and Email are candidate keys—either one uniquely identifies a student. We’d choose one (say Student_ID) as the primary key.

8.2.2.12 Inference Rules for Functional Dependencies

Beyond splitting and combining, there are several other important rules for deriving new functional dependencies from existing ones. These rules are called Armstrong’s Axioms and form a complete system for reasoning about FDs.

Reflexivity (Trivial FD Rule):

If \(\bar{B} \subseteq \bar{A}\), then \(\bar{A} \rightarrow \bar{B}\)

Any set of attributes determines its subsets.

Example: \(\{\text{Bridge\_ID, Bridge\_Name}\} \rightarrow \text{Bridge\_ID}\) (trivially true)

This captures trivial FDs—they’re always true but not informative.

Augmentation:

If \(\bar{A} \rightarrow \bar{B}\), then \(\{\bar{A}, \bar{C}\} \rightarrow \{\bar{B}, \bar{C}\}\) for any set \(\bar{C}\)

If \(\bar{A}\) determines \(\bar{B}\), then adding attributes to both sides preserves the dependency.

Example: From \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name}\), we get: \[\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \{\text{Bridge\_Name, Sensor\_ID}\}\]

Transitivity:

If \(\bar{A} \rightarrow \bar{B}\) and \(\bar{B} \rightarrow \bar{C}\), then \(\bar{A} \rightarrow \bar{C}\)

Dependencies chain together.

Example: Suppose we have:

  • \(\text{Sensor\_ID} \rightarrow \text{Bridge\_ID}\)
  • \(\text{Bridge\_ID} \rightarrow \text{Year\_Built}\)

Then by transitivity: \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\)

This makes intuitive sense: if a sensor ID determines which bridge, and the bridge determines when it was built, then the sensor ID determines when the bridge was built.

Derived Rules:

From Armstrong’s Axioms, we can derive additional useful rules:

Union: If \(\bar{A} \rightarrow \bar{B}\) and \(\bar{A} \rightarrow \bar{C}\), then \(\bar{A} \rightarrow \{\bar{B}, \bar{C}\}\)

Decomposition: If \(\bar{A} \rightarrow \{\bar{B}, \bar{C}\}\), then \(\bar{A} \rightarrow \bar{B}\) and \(\bar{A} \rightarrow \bar{C}\)

Pseudotransitivity: If \(\bar{A} \rightarrow \bar{B}\) and \(\{\bar{B}, \bar{C}\} \rightarrow \bar{D}\), then \(\{\bar{A}, \bar{C}\} \rightarrow \bar{D}\)

These rules are all you need to derive any FD that logically follows from a given set of FDs. This completeness property is mathematically elegant—Armstrong’s Axioms are both sound (only derive true FDs) and complete (can derive all true FDs).

8.2.2.13 Closures

The closure of a set of attributes is a powerful tool for answering questions about functional dependencies. It tells us all attributes that can be determined from a given set.

Definition: Given a set of functional dependencies \(F\) and a set of attributes \(\bar{A}\), the closure of \(\bar{A}\) (denoted \(\bar{A}^+\)) is:

The set of all attributes \(B\) such that \(\bar{A} \rightarrow B\) can be inferred from \(F\) using Armstrong’s Axioms.

In other words, \(\bar{A}^+\) contains all attributes functionally determined by \(\bar{A}\).

Computing the Closure:

Here’s the algorithm for computing \(\bar{A}^+\) given a set of FDs:

  1. Start with \(\bar{A}^+ = \bar{A}\) (the closure includes the starting attributes)
  2. Repeat until no more attributes can be added:
    • For each FD \(\bar{X} \rightarrow \bar{Y}\) in \(F\):
      • If \(\bar{X} \subseteq \bar{A}^+\) (all attributes in \(\bar{X}\) are already in the closure)
      • Then add \(\bar{Y}\) to \(\bar{A}^+\) (we can now determine \(\bar{Y}\) too)
  3. Return \(\bar{A}^+\)

Example: Consider the BridgeSensor relation with these FDs:

  • \(F_1\): \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)
  • \(F_2\): \(\text{Sensor\_ID} \rightarrow \text{Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\)
  • \(F_3\): \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)

Compute: \(\{\text{Sensor\_ID}\}^+\)

Step 1: Start with \(\{\text{Sensor\_ID}\}\)

Step 2: Apply \(F_2\) since Sensor_ID is in our closure:

  • Add Sensor_Type, Install_Date, Bridge_ID, Manufacturer
  • Now: \(\{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\}\)

Step 3: Apply \(F_1\) since Bridge_ID is now in our closure:

  • Add Bridge_Name, Year_Built, Length
  • Now: \(\{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length}\}\)

Step 4: Apply \(F_3\) since Manufacturer is another attribute with an associated FD that we added to our closure in Step 2:

  • Add Mfr_Country
  • Now: \(\{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length, Mfr\_Country}\}\)

Step 5: No more FDs apply. Done!

Result: \(\{\text{Sensor\_ID}\}^+ = \{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length, Mfr\_Country}\}\)

This tells us: knowing the Sensor_ID, we can determine ALL attributes in the relation—not just sensor properties, but also which bridge it’s on, all that bridge’s properties, the manufacturer, and the manufacturer’s country (all via transitivity). This confirms that Sensor_ID is a superkey (and in fact, a key).

Another Example: \(\{\text{Bridge\_ID}\}^+\)

Step 1: Start with \(\{\text{Bridge\_ID}\}\)

Step 2: Apply \(F_1\):

  • Add Bridge_Name, Year_Built, Length
  • Now: \(\{\text{Bridge\_ID, Bridge\_Name, Year\_Built, Length}\}\)

Step 3: Can’t apply \(F_2\) (don’t have Sensor_ID) or \(F_3\) (don’t have Manufacturer)

Result: \(\{\text{Bridge\_ID}\}^+ = \{\text{Bridge\_ID, Bridge\_Name, Year\_Built, Length}\}\)

Bridge_ID only determines bridge properties, not sensor or manufacturer properties.

8.2.2.14 Closures and Keys

Closures provide a simple test for whether a set of attributes is a key.

Key Test Using Closure:

A set of attributes \(\bar{A}\) is a superkey for relation \(R\) if and only if \(\bar{A}^+\) contains all attributes of \(R\).

A superkey is a key if no proper subset of \(\bar{A}\) is also a superkey (minimality condition).

Example: Is \(\{\text{Sensor\_ID}\}\) a key for BridgeSensor?

From our earlier calculation: \(\{\text{Sensor\_ID}\}^+ = \{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length, Mfr\_Country}\}\)$

This is all 9 attributes! Therefore, \(\{\text{Sensor\_ID}\}\) is a superkey.

Is it a key? Since it’s a single attribute, it’s automatically minimal—we can’t remove any attributes from it. Therefore, \(\{\text{Sensor\_ID}\}\) is a key for BridgeSensor.

Example: Is \(\{\text{Bridge\_ID, Sensor\_ID}\}\) a key?

Let’s compute the closure:

Start: \(\{\text{Bridge\_ID, Sensor\_ID}\}\)

Apply \(F_1\) (Bridge_ID → Bridge_Name, Year_Built, Length):

  • Add: Bridge_Name, Year_Built, Length

Apply \(F_2\) (Sensor_ID → Sensor_Type, Install_Date, Bridge_ID, Manufacturer):

  • Add: Sensor_Type, Install_Date, Manufacturer (Bridge_ID already there)

Apply \(F_3\) (Manufacturer → Mfr_Country):

  • Add: Mfr_Country

Result: All 9 attributes! So \(\{\text{Bridge\_ID, Sensor\_ID}\}\) is a superkey.

But is it a key? Check if any proper subset is also a superkey:

\(\{\text{Bridge\_ID}\}^+ = \{\text{Bridge\_ID, Bridge\_Name, Year\_Built, Length}\}\) — not all attributes

\(\{\text{Sensor\_ID}\}^+\) results in all attributes as we have seen before.

Since \(\{\text{Sensor\_ID}\}\) is a superkey and a proper subset of \(\{\text{Bridge\_ID, Sensor\_ID}\}\), the composite is not a key (it violates minimality). The key for BridgeSensor is just Sensor_ID—each row represents one sensor, and sensor IDs are unique.

Finding All Keys Systematically:

How do we find all candidate keys for a relation?

Naive approach: Test every subset of attributes. For \(n\) attributes, that’s \(2^n\) subsets—exponential and impractical for large relations.

Better approach: Use the minimality property

  1. Start by testing single attributes: If any attribute \(A\) has \(\{A\}^+ = \text{all attributes}\), then \(\{A\}\) is a key
  2. Test pairs of attributes: \(\{A, B\}\)
  3. Test triples: \(\{A, B, C\}\)
  4. And so on…

Key optimization: If \(\bar{C}\) is a key, then any superset \(\{\bar{C}, D\}\) is a superkey but not a key (violates minimality). So once we find a key of size \(k\), we don’t need to test any supersets of that key.

Example: For BridgeSensor with 9 attributes, we’d test:

  • 9 single-attribute sets
  • If none are keys, test \(\binom{9}{2} = 36\) pairs
  • And so on…

In practice, domain knowledge often tells us what the keys are. But closure computation gives us a mechanical procedure when we’re unsure.

8.2.2.15 Specifying Functional Dependencies for a Relation

When designing a database, we need to specify which functional dependencies hold. But there are infinitely many true FDs (including all the trivial ones). How do we choose which ones to write down?

Logical Implication:

We say a set of FDs \(S_2\) follows from (or is implied by) a set \(S_1\) if:

Every relation instance that satisfies all FDs in \(S_1\) also satisfies all FDs in \(S_2\).

In other words, \(S_1\) logically entails \(S_2\).

Example:

Given: \(S_1 = \{\text{Sensor\_ID} \rightarrow \text{Bridge\_ID}, \text{Bridge\_ID} \rightarrow \text{Year\_Built}\}\)

Does \(S_1\) imply \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\)?

Yes! By transitivity. Any relation satisfying \(S_1\) must also satisfy \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\).

Testing If an FD Follows:

To test whether \(\bar{A} \rightarrow \bar{B}\) follows from a set \(S\) of FDs:

  1. Compute \(\bar{A}^+\) using the FDs in \(S\)
  2. If \(\bar{B} \subseteq \bar{A}^+\), then \(\bar{A} \rightarrow \bar{B}\) follows from \(S\)
  3. Otherwise, it doesn’t

Example:

Given FDs:

  • \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)
  • \(\text{Sensor\_ID} \rightarrow \text{Bridge\_ID, Sensor\_Type}\)

Does \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\) follow?

Compute \(\{\text{Sensor\_ID}\}^+\):

  • Start: Sensor_ID
  • Apply second FD: Add Bridge_ID, Sensor_Type
  • Apply first FD: Add Bridge_Name, Year_Built, Length
  • Result: {Sensor_ID, Bridge_ID, Sensor_Type, Bridge_Name, Year_Built, Length}

Since Year_Built is in the closure, yes, \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\) follows.

Minimal Basis (Canonical Cover):

Our goal when specifying FDs is to find a minimal set that captures all the dependencies without redundancy.

A set of FDs \(S\) is minimal (or a canonical cover) if:

  1. All FDs are completely nontrivial (right side doesn’t overlap with left side)
  2. No FD in \(S\) can be derived from the others (no redundancy)
  3. No attribute in any left side is redundant (can’t make any left side smaller)

Example of redundant FDs:

Consider:

  • \(\text{A} \rightarrow \text{B}\)
  • \(\text{B} \rightarrow \text{C}\)
  • \(\text{A} \rightarrow \text{C}\)

The third FD is redundant—it follows from the first two by transitivity. A minimal set would omit it.

Why minimality matters:

  • Clarity: Easier to understand the essential dependencies
  • Efficiency: Fewer constraints to check when inserting/updating data
  • Design: Minimal FD sets clearly reveal the structure needed for normalization

In practice, you identify FDs based on domain knowledge (understanding what real-world properties constrain the data), then optionally simplify to a minimal basis. The minimal basis is what you’ll use when decomposing relations.

8.2.3 Normal Forms

Now that we understand functional dependencies, we can study normal forms—quality criteria for database design. Each normal form eliminates certain types of anomalies by restricting what functional dependencies are allowed.

Normal forms form a hierarchy:

\[\text{1NF} \subset \text{2NF} \subset \text{3NF} \subset \text{BCNF}\]

Every relation in 2NF is also in 1NF. Every relation in 3NF is also in 2NF. And so on. Higher normal forms are “stricter” and eliminate more anomalies.

8.2.3.1 Why Normalization Matters

Unnormalized databases suffer from several problems:

  1. Anomalies: Redundancy leads to update, deletion, and insertion anomalies (as we saw earlier)
  2. Inefficient updates: Programs must update multiple copies of the same data
  3. Complex indexing: Indexing large, redundant tables is cumbersome
  4. Wasted storage: Repeated data consumes unnecessary space
  5. Maintenance nightmares: Changes to schema or constraints require updating many places

Normalization systematically eliminates these problems by decomposing relations according to their functional dependencies.

Important notes:

  • Normalization should be part of the design process, not an afterthought
  • The normalization process may reveal additional entities and relationships to add to your ER diagram
  • ER modeling and normalization should be used together—ER diagrams help identify entities and relationships, normalization ensures the resulting schema is well-designed

8.2.3.2 The Running Example

We’ll use our BridgeSensor mega-relation from earlier:

BridgeSensor(Bridge_ID, Bridge_Name, Year_Built, Length, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)

With functional dependencies:

  • \(F_1\): \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)
  • \(F_2\): \(\text{Sensor\_ID} \rightarrow \text{Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\)
  • \(F_3\): \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)

Key: \(\{\text{Sensor\_ID}\}\) (as we discovered via closure)

Sample data:

Bridge_ID Bridge_Name Year_Built Length Sensor_ID Sensor_Type Install_Date Manufacturer Mfr_Country
B001 Roberto Clemente Bridge 1928 244 S001 accelerometer 2020-01-15 Acme Sensors USA
B001 Roberto Clemente Bridge 1928 244 S002 strain_gauge 2020-01-15 Acme Sensors USA
B001 Roberto Clemente Bridge 1928 244 S003 temperature 2020-02-10 TempTech Germany
B002 Fort Pitt Bridge 1959 1201 S004 accelerometer 2019-12-01 Acme Sensors USA

We’ll progressively normalize this relation through 1NF, 2NF, 3NF, and finally BCNF.

8.2.3.3 First Normal Form (1NF)

Definition: A relation is in First Normal Form (1NF) if:

  1. All attribute values are atomic (indivisible)
  2. Each attribute contains only a single value from its domain (no sets, lists, or repeating groups)
  3. There is a primary key that uniquely identifies each row

What 1NF eliminates: Repeating groups and multi-valued attributes.

Simple Example of 1NF Violation:

Consider a poorly designed Employee relation:

Employee_ID Name Phone_Numbers
E001 Alice 412-555-0100, 412-555-0101
E002 Bob 412-555-0200

The Phone_Numbers column violates 1NF—it contains a list (multiple phone numbers in one cell).

To fix: Create separate rows for each phone number:

Employee_ID Name Phone_Number
E001 Alice 412-555-0100
E001 Alice 412-555-0101
E002 Bob 412-555-0200

Now each cell contains a single atomic value.

Our BridgeSensor Example:

Our BridgeSensor relation is already in 1NF:

  • All values are atomic (single values, not lists)
  • Primary key exists: Sensor_ID

So we can move directly to analyzing higher normal forms. The real action begins with 2NF, where we address partial dependencies.

8.2.3.4 Second Normal Form (2NF)

Definition: A relation is in Second Normal Form (2NF) if:

  1. It is in 1NF, AND
  2. Every non-key attribute is fully functionally dependent on the entire primary key (no partial dependencies)

What is a partial dependency?

A partial dependency occurs when a non-key attribute depends on only part of a composite primary key, not the whole key.

If the primary key is a single attribute, partial dependencies are impossible—the relation is automatically in 2NF (assuming it’s in 1NF).

Modified BridgeSensor for 2NF Demonstration:

To demonstrate 2NF violations, let’s adjust our example slightly. Suppose sensor IDs are only unique within each bridge (Sensor 1 on Bridge A is different from Sensor 1 on Bridge B). Then:

Key: \(\{\text{Bridge\_ID, Sensor\_ID}\}\) (composite key)

BridgeSensor(Bridge_ID, Bridge_Name, Year_Built, Length, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)

Functional dependencies:

  • \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)Partial! (depends only on part of key)
  • \(\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \text{Sensor\_Type, Install\_Date, Manufacturer}\)
  • \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)

Problem: Bridge_Name, Year_Built, and Length depend only on Bridge_ID, not on the full key {Bridge_ID, Sensor_ID}. This is a partial dependency, violating 2NF.

Anomalies from 2NF violation:

Looking at the data:

Bridge_ID Bridge_Name Year_Built Length Sensor_ID Sensor_Type Install_Date Manufacturer Mfr_Country
B001 Roberto Clemente Bridge 1928 244 1 accelerometer 2020-01-15 Acme Sensors USA
B001 Roberto Clemente Bridge 1928 244 2 strain_gauge 2020-01-15 Acme Sensors USA
B001 Roberto Clemente Bridge 1928 244 3 temperature 2020-02-10 TempTech Germany

Bridge information repeats for each sensor—classic redundancy from partial dependency.

Decomposition to 2NF:

Remove attributes that depend on only part of the key:

Bridge(Bridge_ID, Bridge_Name, Year_Built, Length)

  • Primary key: Bridge_ID
  • No partial dependencies (key is single attribute)

BridgeSensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)

  • Primary key: {Bridge_ID, Sensor_ID}
  • All non-key attributes depend on the full key

Now the tables:

Bridge:

Bridge_ID Bridge_Name Year_Built Length
B001 Roberto Clemente Bridge 1928 244
B002 Fort Pitt Bridge 1959 1201

BridgeSensor:

Bridge_ID Sensor_ID Sensor_Type Install_Date Manufacturer Mfr_Country
B001 1 accelerometer 2020-01-15 Acme Sensors USA
B001 2 strain_gauge 2020-01-15 Acme Sensors USA
B001 3 temperature 2020-02-10 TempTech Germany
B002 1 accelerometer 2019-12-01 Acme Sensors USA

Benefits of 2NF:

  • Eliminated redundancy: Bridge information appears only once
  • No update anomalies: Changing a bridge’s length requires updating one row
  • No deletion anomalies: Removing all sensors from a bridge doesn’t delete bridge information
  • No insertion anomalies: Can add a bridge without sensors

Note: Relations with single-attribute primary keys (like Bridge above) are automatically in 2NF if they’re in 1NF, since partial dependencies are impossible.

Next problem: BridgeSensor is now in 2NF, but still has issues. Notice Manufacturer appears multiple times, and Mfr_Country is redundant. This is a transitive dependency, which 3NF addresses.

8.2.3.5 Third Normal Form (3NF)

Definition: A relation is in Third Normal Form (3NF) if:

  1. It is in 2NF, AND
  2. No non-key attribute is transitively dependent on the primary key

What is a transitive dependency?

A transitive dependency occurs when: - \(\text{Key} \rightarrow A \rightarrow B\)

In other words, a non-key attribute \(A\) determines another non-key attribute \(B\). The key determines \(B\) indirectly through \(A\) (by transitivity).

Example from BridgeSensor (currently in 2NF):

BridgeSensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)

Functional dependencies:

  • \(\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \text{Manufacturer}\) (key determines Manufacturer)
  • \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\) (non-key determines non-key)

By transitivity: \(\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \text{Manufacturer} \rightarrow \text{Mfr\_Country}\)

This is a transitive dependency—the key determines Mfr_Country indirectly through Manufacturer. Violates 3NF!

Problem with transitive dependencies:

Look at the data:

Bridge_ID Sensor_ID Sensor_Type Install_Date Manufacturer Mfr_Country
B001 1 accelerometer 2020-01-15 Acme Sensors USA
B001 2 strain_gauge 2020-01-15 Acme Sensors USA
B001 3 temperature 2020-02-10 TempTech Germany
B002 1 accelerometer 2019-12-01 Acme Sensors USA

Notice “Acme Sensors → USA” appears three times. If Acme Sensors moved to Canada:

  • We’d need to update multiple rows (update anomaly)
  • If we update inconsistently, we have contradictory data

Decomposition to 3NF:

Identify the transitive dependency and split it out:

  1. Identify determinants: Manufacturer is a determinant (determines Mfr_Country)
  2. Create new relation: Make a relation with Manufacturer as the key
  3. Remove dependent attributes: Remove Mfr_Country from BridgeSensor

Sensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer)

  • Primary key: {Bridge_ID, Sensor_ID}
  • Foreign key: Manufacturer references Manufacturer table
  • No transitive dependencies!

Manufacturer(Manufacturer, Mfr_Country)

  • Primary key: Manufacturer
  • Simple relation, no transitive dependencies

Now the tables:

Sensor:

Bridge_ID Sensor_ID Sensor_Type Install_Date Manufacturer
B001 1 accelerometer 2020-01-15 Acme Sensors
B001 2 strain_gauge 2020-01-15 Acme Sensors
B001 3 temperature 2020-02-10 TempTech
B002 1 accelerometer 2019-12-01 Acme Sensors

Manufacturer:

Manufacturer Mfr_Country
Acme Sensors USA
TempTech Germany

Benefits of 3NF:

  • No redundancy from transitive dependencies: Each manufacturer’s country appears once
  • No update anomalies: Changing Acme Sensors’ country requires one update
  • No deletion anomalies: Deleting all sensors from a manufacturer doesn’t delete manufacturer info
  • Better data integrity: Can’t have inconsistent countries for the same manufacturer

Summary so far:

We now have three relations, all in 3NF:

  1. Bridge(Bridge_ID, Bridge_Name, Year_Built, Length)
  2. Sensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer)
  3. Manufacturer(Manufacturer, Mfr_Country)

Are we done? For most practical purposes, yes—3NF is often good enough. But there’s one more normal form worth knowing: BCNF, which handles a subtle edge case.

8.2.3.6 Boyce-Codd Normal Form (BCNF)

Definition: A relation is in Boyce-Codd Normal Form (BCNF) if:

For every nontrivial functional dependency \(\bar{A} \rightarrow \bar{B}\), \(\bar{A}\) is a superkey (contains a candidate key).

In simpler terms: Every determinant must be a candidate key.

BCNF is a stricter version of 3NF. While 3NF says “no non-key attribute determines another non-key attribute,” BCNF says “ONLY keys can be determinants.”

Relationship to 3NF:

  • Every relation in BCNF is also in 3NF
  • Most relations in 3NF are also in BCNF
  • BCNF violations occur in special cases, typically when there are overlapping candidate keys

When does BCNF differ from 3NF?

BCNF catches cases where a non-key attribute determines a key attribute (or part of a key). This doesn’t violate 3NF (which only restricts non-key → non-key dependencies), but it still causes anomalies.

Example: Bridge Inspection Schedule

Consider a relation tracking bridge inspections:

Inspection(Inspector, Bridge, Bridge_Type, Inspection_Date)

Business rules:

  1. Each inspector specializes in exactly one bridge type (arch, suspension, truss)
  2. Each bridge has exactly one type
  3. Each inspector inspects each bridge at most once

Functional dependencies:

  • \(\text{Inspector} \rightarrow \text{Bridge\_Type}\) (each inspector specializes in one type)
  • \(\text{Bridge} \rightarrow \text{Bridge\_Type}\) (each bridge has one type)
  • \(\{\text{Inspector, Bridge}\} \rightarrow \text{Inspection\_Date}\) (key determines date)

Keys: \(\{\text{Inspector, Bridge}\}\) is the only candidate key

Sample data:

Inspector Bridge Bridge_Type Inspection_Date
Alice B001 suspension 2024-01-15
Alice B002 suspension 2024-02-10
Bob B003 arch 2024-01-20
Bob B004 arch 2024-03-05
Carol B005 truss 2024-02-01

Is this in 3NF?

Check for transitive dependencies (non-key → non-key):

  • Bridge_Type is determined by Inspector (both are part of different candidate keys or non-key)
  • But wait—Inspector and Bridge are both part of the primary key!
  • No non-key attribute determines another non-key attribute

Yes, this is in 3NF.

Is this in BCNF?

Check if every determinant is a superkey:

  • \(\text{Inspector} \rightarrow \text{Bridge\_Type}\), but Inspector alone is NOT a superkey
  • Inspector is a determinant but not a candidate key!

No, this violates BCNF.

Problem—Anomalies despite being in 3NF:

Look at the data again:

Inspector Bridge Bridge_Type Inspection_Date
Alice B001 suspension 2024-01-15
Alice B002 suspension 2024-02-10

Bridge_Type “suspension” is repeated for every bridge Alice inspects.

  • Update anomaly: If Alice switches specializations from suspension to arch, we must update every row where she appears
  • Insertion anomaly: We can’t record that a new inspector specializes in truss bridges until they inspect a bridge
  • Deletion anomaly: If Bob stops inspecting B003 and B004, we lose the information that Bob specializes in arch bridges

Decomposition to BCNF:

Split based on the problematic FD \(\text{Inspector} \rightarrow \text{Bridge\_Type}\):

InspectorSpecialty(Inspector, Bridge_Type)

  • Primary key: Inspector
  • FD: \(\text{Inspector} \rightarrow \text{Bridge\_Type}\) ✓ (determinant IS a key)

BridgeInspection(Inspector, Bridge, Inspection_Date)

  • Primary key: {Inspector, Bridge}
  • FD: \(\{\text{Inspector, Bridge}\} \rightarrow \text{Inspection\_Date}\) ✓ (determinant IS a key)

Now the tables:

InspectorSpecialty:

Inspector Bridge_Type
Alice suspension
Bob arch
Carol truss

BridgeInspection:

Inspector Bridge Inspection_Date
Alice B001 2024-01-15
Alice B002 2024-02-10
Bob B003 2024-01-20
Bob B004 2024-03-05
Carol B005 2024-02-01

Both relations are now in BCNF. Every determinant is a candidate key.

Benefits of BCNF:

  • No redundancy: Each inspector’s specialty appears once
  • No anomalies: All the update/insertion/deletion anomalies are eliminated
  • Stronger guarantee: BCNF is the “gold standard” for functional dependency-based normal forms

Note: We also need the FD \(\text{Bridge} \rightarrow \text{Bridge\_Type}\) to be enforced. We’d need:

BridgeType(Bridge, Bridge_Type)

Making our final BCNF schema:

  1. InspectorSpecialty(Inspector, Bridge_Type)
  2. BridgeInspection(Inspector, Bridge, Inspection_Date)
  3. BridgeType(Bridge, Bridge_Type)

When to stop at 3NF vs. push to BCNF:

  • BCNF is the theoretical ideal: No anomalies from functional dependencies

  • 3NF is sometimes preferred when achieving BCNF would:

    • Require too many joins for common queries (performance cost)
    • Lose dependency preservation (can’t enforce all original FDs with just the decomposed tables)

For most applications, BCNF is the target. But there are legitimate cases where 3NF is good enough.

8.2.3.7 Higher Normal Forms

BCNF handles all anomalies arising from functional dependencies. But there are other types of dependencies that cause anomalies, leading to even higher normal forms.

Fourth Normal Form (4NF):

4NF addresses multivalued dependencies (MVDs)—situations where one attribute determines a set of values for another attribute, independently of other attributes.

Example: Suppose an instructor teaches multiple courses and has multiple phone numbers, independently:

Instructor Course Phone
Dr. Smith DB Design 555-0100
Dr. Smith DB Design 555-0101
Dr. Smith Data Mining 555-0100
Dr. Smith Data Mining 555-0101

Every combination appears, creating redundancy. The multivalued dependency is:

\[\text{Instructor} \twoheadrightarrow \text{Course}\] \[\text{Instructor} \twoheadrightarrow \text{Phone}\]

To achieve 4NF, split into two relations:

  • InstructorCourse(Instructor, Course)
  • InstructorPhone(Instructor, Phone)

Fifth Normal Form (5NF) and Beyond:

5NF (also called Projection-Join Normal Form) deals with join dependencies—complex cases where information can only be reconstructed by joining three or more tables.

Higher normal forms (6NF, Domain-Key Normal Form) exist but are primarily of theoretical interest. They’re rarely used in practice because:

  1. BCNF eliminates virtually all practical anomalies in most databases
  2. Higher normal forms can require excessive joins, hurting query performance
  3. The additional complexity rarely justifies the marginal benefit

Practical Advice:

Don’t assume the highest level of normalization is always the most desirable.

  • BCNF is the sweet spot for most applications—strong guarantees without excessive complexity

  • Higher normal forms require more joins, which can slow query performance

  • Denormalization (intentionally violating normal forms) is sometimes appropriate for:

    • Read-heavy applications where query speed matters more than update efficiency
    • Data warehouses optimized for analytics
    • Caching frequently accessed aggregated data

The key is understanding the tradeoffs and making informed design decisions based on your specific requirements.