8 Module 7: Database Design
8.1 Module Overview
8.1.1 Learning Objectives
By the end of this module, students will be able to:
- Explain how relational databases are designed and connect the theory behind it to earlier concepts such as Entity Relationship Diagrams and Relational Algebra
- Explain how the properties of the data lead to certain dependencies that allow for it to be structured, and understand how different structures lead to different design choices named “normal forms”
- Decompose a relation up to its Boyce-Codd Normal Form (BCNF), when possible
- Derive functional dependencies given properties of the data, and draw these dependencies in a diagram
- Understand the concept of closures and keys of relations
8.1.2 Topics Covered
- Principles of database design
- Properties and normal forms
- Functional dependencies
- Keys and closures
8.1.3 Project Milestones
Meet with your team to agree on a set of milestones and deliverables that are needed to complete the final project. This will be required as part of Assignment 4.
8.2 Lecture Notes
8.2.1 Why do we need to design a database?
In Modules 5 and 6, we learned the relational model and SQL—giving us the tools to create databases and query them. We even saw mechanical rules for translating ER diagrams into relational schemas. So we’re done, right? We can design databases!
Not quite. While you can create a database from an ER diagram, the question remains: Is your design any good?
Different designs can store the same information, but some designs lead to:
- Wasted storage from redundant data
- Inconsistencies when data is updated incorrectly
- Lost information when rows are deleted
- Maintenance nightmares as the database evolves
Database design theory helps us create schemas that avoid these problems. The good news is there’s elegant mathematical theory—based on functional dependencies and normal forms—that guides us toward better designs, often automatically.
8.2.1.1 A Poorly Designed Database
Let’s see what can go wrong. Suppose we want to track our bridge monitoring system, and we decide to store everything in one big table:
BridgeSensor (poorly designed):
| Bridge_ID | Bridge_Name | Year_Built | Length | Sensor_ID | Sensor_Type | Install_Date | Manufacturer | Mfr_Country |
|---|---|---|---|---|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 | S001 | accelerometer | 2020-01-15 | Acme Sensors | USA |
| B001 | Roberto Clemente Bridge | 1928 | 244 | S002 | strain_gauge | 2020-01-15 | Acme Sensors | USA |
| B001 | Roberto Clemente Bridge | 1928 | 244 | S003 | temperature | 2020-02-10 | TempTech | Germany |
| B002 | Fort Pitt Bridge | 1959 | 1201 | S004 | accelerometer | 2019-12-01 | Acme Sensors | USA |
| B002 | Fort Pitt Bridge | 1959 | 1201 | S005 | strain_gauge | 2019-12-01 | SensorCorp | Japan |
This design works—we can store data and query it. But look closely at the problems:
Problem 1: Redundancy
Notice how “Roberto Clemente Bridge”, “1928”, and “244” appear three times—once for each sensor on bridge B001. Similarly, “Acme Sensors” and “USA” appear three times. We’re storing the same information repeatedly, wasting space.
Problem 2: Update Anomalies
Suppose we discover the Roberto Clemente Bridge is actually 245 meters long, not 244. We must update three rows. If we forget to update one row, our database becomes inconsistent—some rows say 244, others say 245. Which is correct?
Problem 3: Deletion Anomalies
What if we remove all sensors from the Fort Pitt Bridge (maybe for maintenance)? We’d delete both rows with Bridge_ID = B002. But now we’ve lost all information about that bridge—we no longer know it exists, when it was built, or its length. The sensor data deletion accidentally deleted bridge data too.
Problem 4: Insertion Anomalies
Suppose we build a new bridge but haven’t installed sensors yet. We can’t add a row to BridgeSensor because we don’t have sensor information to fill in. We’re forced to either wait until sensors are installed, or insert NULL values for sensor columns, or create dummy sensor entries—all unsatisfying workarounds.
8.2.1.2 A Better Design: Decomposition
Now let’s decompose this into separate tables following proper design principles:
Bridge:
| Bridge_ID | Bridge_Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
| B003 | Fort Duquesne Bridge | 1969 | 1201 |
Sensor:
| Sensor_ID | Sensor_Type | Install_Date | Bridge_ID | Manufacturer_ID |
|---|---|---|---|---|
| S001 | accelerometer | 2020-01-15 | B001 | M001 |
| S002 | strain_gauge | 2020-01-15 | B001 | M001 |
| S003 | temperature | 2020-02-10 | B001 | M002 |
| S004 | accelerometer | 2019-12-01 | B002 | M001 |
| S005 | strain_gauge | 2019-12-01 | B002 | M003 |
Manufacturer:
| Manufacturer_ID | Manufacturer | Mfr_Country |
|---|---|---|
| M001 | Acme Sensors | USA |
| M002 | TempTech | Germany |
| M003 | SensorCorp | Japan |
How decomposition solves the problems:
No redundancy: Each bridge’s information appears exactly once. Each manufacturer’s information appears exactly once.
No update anomalies: To change a bridge’s length, we update exactly one row in the Bridge table. Impossible to create inconsistency.
No deletion anomalies: Removing all sensors from a bridge (deleting rows from Sensor table) doesn’t affect the Bridge table. Bridge information persists.
No insertion anomalies: We can add a new bridge to the Bridge table without having any sensors. We can add a new manufacturer without having sensors from them yet.
8.2.1.3 Design Theory to the Rescue
The decomposition above looks obviously better. But how do we systematically arrive at such designs? How do we know we’re done decomposing? Could we decompose further?
This is where database design theory comes in:
- Functional dependencies capture properties of the data (e.g., “Bridge_ID determines Bridge_Name”)
- Normal forms (1NF, 2NF, 3NF, BCNF) provide target structures that eliminate anomalies
- Decomposition algorithms automatically break down mega-relations into well-designed tables
The rest of this module develops this theory systematically. We’ll learn how to: 1. Identify functional dependencies from data properties 2. Recognize when a schema violates normal forms 3. Decompose relations to achieve higher normal forms (usually BCNF) 4. Verify that our decomposition preserves all information
By the end, you’ll have principled techniques for designing databases that are efficient, consistent, and maintainable.
8.2.2 Database design theory
How do we systematically design good databases? Database design theory provides a principled approach based on decomposition.
8.2.2.1 The Decomposition Approach
The core idea is simple but powerful:
- Start with a “mega-relation” containing all attributes in a single table
- Identify properties of the data (functional dependencies)
- Decompose systematically into smaller relations based on these properties
- Achieve a normal form that guarantees no anomalies and no information loss
This process can be done algorithmically—given a set of functional dependencies, we can automatically decompose a relation into a well-designed schema.
8.2.2.2 The Decomposition Process
Here’s the general workflow:
INPUT:
- A relation \(R\) with all attributes (the “mega-relation”)
- Functional dependencies describing properties of the data (e.g., “Bridge_ID determines Bridge_Name”)
- Optionally: multivalued dependencies for more complex cases
PROCEDURE:
- Analyze functional dependencies to identify problems (violations of normal forms)
- Systematically decompose \(R\) into smaller relations \(R_1, R_2, \ldots, R_n\)
- Each decomposition step removes one source of anomalies
OUTPUT:
A set of relations in a target normal form (typically BCNF or 3NF)
Guarantees:
- No anomalies: No redundancy, no update/deletion/insertion anomalies
- Lossless decomposition: We can reconstruct the original information by joining the decomposed relations
- Dependency preservation (when possible): All original constraints are still enforceable
8.2.2.3 Normal Forms: Quality Levels for Designs
Normal forms are increasingly strict criteria for database design. Think of them as quality levels:
- First Normal Form (1NF): Eliminate repeating groups and ensure atomic values
- Second Normal Form (2NF): Eliminate partial dependencies on composite keys
- Third Normal Form (3NF): Eliminate transitive dependencies
- Boyce-Codd Normal Form (BCNF): Every determinant must be a candidate key
Each normal form eliminates certain types of anomalies. Higher normal forms are “better” but sometimes require tradeoffs. BCNF is the sweet spot for most applications—it eliminates all anomalies arising from functional dependencies.
8.2.2.4 Why This Matters
The beautiful thing about this theory is that it’s automatic and systematic:
- You don’t need to guess what tables to create
- You don’t rely on intuition about what “feels right”
- The math tells you exactly how to decompose
This doesn’t mean design is mechanical—you still need to:
- Understand your domain well enough to identify functional dependencies
- Sometimes make tradeoffs between normal forms and query performance
- Recognize when denormalization makes sense for specific use cases
But the theory gives you a solid foundation and a principled starting point.
8.2.2.5 Roadmap for This Module
To understand and apply this theory, we’ll proceed as follows:
- Functional dependencies: Learn what they are, how to identify them, and rules for reasoning about them
- Keys and closures: Understand how functional dependencies relate to keys
- Normal forms: Study 1NF, 2NF, 3NF, and BCNF in detail with examples
- Decomposition algorithms: See how to systematically achieve normal forms
Let’s start with functional dependencies—the foundation of everything else.
8.2.2.6 Functional Dependencies
A functional dependency (FD) is a constraint that describes a relationship between attributes. It captures the idea that knowing the value of certain attributes allows us to uniquely determine the value of other attributes.
Formal definition: Given a relation \(R\), a functional dependency \(\bar{A} \rightarrow \bar{B}\) holds if:
Whenever two tuples in \(R\) agree on all attributes in \(\bar{A}\), they must also agree on all attributes in \(\bar{B}\).
In other words, the values of attributes in \(\bar{A}\) determine or functionally determine the values of attributes in \(\bar{B}\).
Notation:
- \(\bar{A}\) denotes a set of attributes \(\{A_1, A_2, \ldots, A_n\}\)
- \(\bar{A} \rightarrow \bar{B}\) means “\(\bar{A}\) functionally determines \(\bar{B}\)”
- We say “\(\bar{A}\) is a determinant” and “\(\bar{B}\) depends on \(\bar{A}\)”
8.2.2.7 Examples from the Bridge Monitoring System
Let’s identify functional dependencies in our BridgeSensor mega-relation:
BridgeSensor(Bridge_ID, Bridge_Name, Year_Built, Length, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)
FD 1: \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)
If two rows have the same Bridge_ID, they must have the same Bridge_Name, Year_Built, and Length.
This makes sense: a bridge ID uniquely identifies a bridge, so all properties of that bridge are determined by the ID.
FD 2: \(\text{Sensor\_ID} \rightarrow \text{Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\)
If two rows have the same Sensor_ID, they must have the same Sensor_Type, Install_Date, Bridge_ID, and Manufacturer.
A sensor ID uniquely identifies a sensor and all its properties, including which bridge it’s on and who manufactured it.
FD 3: \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)
If two rows have the same Manufacturer, they must have the same Mfr_Country.
A manufacturer’s name determines where they’re located. (We’re assuming manufacturer names are unique—if not, we’d need a Manufacturer_ID.)
Non-example: We do NOT have \(\text{Year\_Built} \rightarrow \text{Bridge\_Name}\)
Multiple bridges can be built in the same year, so knowing the year doesn’t tell us which bridge. This is not a functional dependency.
8.2.2.8 Visualizing Functional Dependencies
We can draw functional dependency diagrams to show the structure. For BridgeSensor:
Bridge_ID → Bridge_Name, Year_Built, Length
Sensor_ID → Sensor_Type, Install_Date, Bridge_ID, Manufacturer
Manufacturer → Mfr_Country
These dependencies reveal the hidden structure in our data. Notice that Sensor_ID determines Bridge_ID, and Bridge_ID in turn determines bridge properties. By transitivity, Sensor_ID determines all attributes in the relation—making it a candidate key!
8.2.2.9 Types of Functional Dependencies
Given a functional dependency \(\bar{A} \rightarrow \bar{B}\):
Trivial FD: \(\bar{B} \subseteq \bar{A}\)
Every attribute on the right side is also on the left side.
Example: \(\{\text{Bridge\_ID, Bridge\_Name}\} \rightarrow \text{Bridge\_ID}\)
This is always true (trivially) because if two tuples agree on Bridge_ID and Bridge_Name, they obviously agree on Bridge_ID alone. Trivial FDs are not interesting—they don’t tell us anything useful.
Nontrivial FD: \(\bar{B} \not\subseteq \bar{A}\)
At least one attribute on the right side is not on the left side.
Example: \(\text{Bridge\_ID} \rightarrow \{\text{Bridge\_Name, Bridge\_ID}\}\)
This is nontrivial because Bridge_Name is not in the left side.
Completely nontrivial FD: \(\bar{A} \cap \bar{B} = \emptyset\)
No attribute appears on both sides.
Example: \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built}\)
This is completely nontrivial—the sets are disjoint.
Why this matters: Completely nontrivial FDs are the most interesting for database design. They reveal true dependencies between different attributes.
8.2.2.10 Rules for Manipulating Functional Dependencies
Functional dependencies obey certain algebraic rules. These rules let us derive new FDs from existing ones.
Splitting Rule (Right Side):
If \(\bar{A} \rightarrow \{B_1, B_2, \ldots, B_m\}\), then we can split this into multiple FDs: \[\bar{A} \rightarrow B_1, \quad \bar{A} \rightarrow B_2, \quad \ldots, \quad \bar{A} \rightarrow B_m\]
Example: From \(\text{Bridge\_ID} \rightarrow \{\text{Bridge\_Name, Year\_Built, Length}\}\), we get:
- \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name}\)
- \(\text{Bridge\_ID} \rightarrow \text{Year\_Built}\)
- \(\text{Bridge\_ID} \rightarrow \text{Length}\)
Combining Rule (Right Side):
Conversely, if we have multiple FDs with the same left side, we can combine them:
If \(\bar{A} \rightarrow B_1\) and \(\bar{A} \rightarrow B_2\), … and \(\bar{A} \rightarrow B_m\), then: \[\bar{A} \rightarrow \{B_1, B_2, \ldots, B_m\}\]
Important: We cannot split or combine the left side arbitrarily. The FD \(\{A_1, A_2\} \rightarrow B\) is NOT equivalent to \(A_1 \rightarrow B\) and \(A_2 \rightarrow B\).
8.2.2.11 Functional Dependencies and Keys
Keys are intimately connected to functional dependencies. In fact, a key is a set of attributes whose functional dependency extends to all attributes in the relation.
Definition: A set of attributes \(\bar{K}\) is a key for relation \(R\) if:
- \(\bar{K} \rightarrow \text{all attributes of } R\) (the key determines everything)
- No proper subset of \(\bar{K}\) has property 1 (minimality—can’t remove any attribute from \(\bar{K}\) and still have a key)
The first property means that knowing the values of key attributes uniquely identifies a tuple—no two tuples can have the same key values. The second property means the key contains no redundant attributes.
Example: In our BridgeSensor relation:
\(\{\text{Sensor\_ID}\} \rightarrow \text{all attributes}\)
This is a key because:
- Sensor_ID determines all nine attributes in the relation (directly and via transitivity through Bridge_ID and Manufacturer)
- It’s a single attribute, so it’s automatically minimal
Non-example: \(\{\text{Bridge\_ID, Sensor\_ID}\}\) also determines all attributes, but it’s not a key because it’s not minimal—we can remove Bridge_ID and still have a key (just Sensor_ID).
Multiple Candidate Keys:
A relation can have multiple keys. Each is called a candidate key. We typically choose one as the primary key for implementation.
Example: Consider a simpler relation:
Student(Student_ID, Email, Name, GPA)
With functional dependencies:
- \(\text{Student\_ID} \rightarrow \text{Email, Name, GPA}\)
- \(\text{Email} \rightarrow \text{Student\_ID, Name, GPA}\)
Both Student_ID and Email are candidate keys—either one uniquely identifies a student. We’d choose one (say Student_ID) as the primary key.
8.2.2.12 Inference Rules for Functional Dependencies
Beyond splitting and combining, there are several other important rules for deriving new functional dependencies from existing ones. These rules are called Armstrong’s Axioms and form a complete system for reasoning about FDs.
Reflexivity (Trivial FD Rule):
If \(\bar{B} \subseteq \bar{A}\), then \(\bar{A} \rightarrow \bar{B}\)
Any set of attributes determines its subsets.
Example: \(\{\text{Bridge\_ID, Bridge\_Name}\} \rightarrow \text{Bridge\_ID}\) (trivially true)
This captures trivial FDs—they’re always true but not informative.
Augmentation:
If \(\bar{A} \rightarrow \bar{B}\), then \(\{\bar{A}, \bar{C}\} \rightarrow \{\bar{B}, \bar{C}\}\) for any set \(\bar{C}\)
If \(\bar{A}\) determines \(\bar{B}\), then adding attributes to both sides preserves the dependency.
Example: From \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name}\), we get: \[\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \{\text{Bridge\_Name, Sensor\_ID}\}\]
Transitivity:
If \(\bar{A} \rightarrow \bar{B}\) and \(\bar{B} \rightarrow \bar{C}\), then \(\bar{A} \rightarrow \bar{C}\)
Dependencies chain together.
Example: Suppose we have:
- \(\text{Sensor\_ID} \rightarrow \text{Bridge\_ID}\)
- \(\text{Bridge\_ID} \rightarrow \text{Year\_Built}\)
Then by transitivity: \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\)
This makes intuitive sense: if a sensor ID determines which bridge, and the bridge determines when it was built, then the sensor ID determines when the bridge was built.
Derived Rules:
From Armstrong’s Axioms, we can derive additional useful rules:
Union: If \(\bar{A} \rightarrow \bar{B}\) and \(\bar{A} \rightarrow \bar{C}\), then \(\bar{A} \rightarrow \{\bar{B}, \bar{C}\}\)
Decomposition: If \(\bar{A} \rightarrow \{\bar{B}, \bar{C}\}\), then \(\bar{A} \rightarrow \bar{B}\) and \(\bar{A} \rightarrow \bar{C}\)
Pseudotransitivity: If \(\bar{A} \rightarrow \bar{B}\) and \(\{\bar{B}, \bar{C}\} \rightarrow \bar{D}\), then \(\{\bar{A}, \bar{C}\} \rightarrow \bar{D}\)
These rules are all you need to derive any FD that logically follows from a given set of FDs. This completeness property is mathematically elegant—Armstrong’s Axioms are both sound (only derive true FDs) and complete (can derive all true FDs).
8.2.2.13 Closures
The closure of a set of attributes is a powerful tool for answering questions about functional dependencies. It tells us all attributes that can be determined from a given set.
Definition: Given a set of functional dependencies \(F\) and a set of attributes \(\bar{A}\), the closure of \(\bar{A}\) (denoted \(\bar{A}^+\)) is:
The set of all attributes \(B\) such that \(\bar{A} \rightarrow B\) can be inferred from \(F\) using Armstrong’s Axioms.
In other words, \(\bar{A}^+\) contains all attributes functionally determined by \(\bar{A}\).
Computing the Closure:
Here’s the algorithm for computing \(\bar{A}^+\) given a set of FDs:
- Start with \(\bar{A}^+ = \bar{A}\) (the closure includes the starting attributes)
- Repeat until no more attributes can be added:
- For each FD \(\bar{X} \rightarrow \bar{Y}\) in \(F\):
- If \(\bar{X} \subseteq \bar{A}^+\) (all attributes in \(\bar{X}\) are already in the closure)
- Then add \(\bar{Y}\) to \(\bar{A}^+\) (we can now determine \(\bar{Y}\) too)
- For each FD \(\bar{X} \rightarrow \bar{Y}\) in \(F\):
- Return \(\bar{A}^+\)
Example: Consider the BridgeSensor relation with these FDs:
- \(F_1\): \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)
- \(F_2\): \(\text{Sensor\_ID} \rightarrow \text{Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\)
- \(F_3\): \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)
Compute: \(\{\text{Sensor\_ID}\}^+\)
Step 1: Start with \(\{\text{Sensor\_ID}\}\)
Step 2: Apply \(F_2\) since Sensor_ID is in our closure:
- Add Sensor_Type, Install_Date, Bridge_ID, Manufacturer
- Now: \(\{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\}\)
Step 3: Apply \(F_1\) since Bridge_ID is now in our closure:
- Add Bridge_Name, Year_Built, Length
- Now: \(\{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length}\}\)
Step 4: Apply \(F_3\) since Manufacturer is another attribute with an associated FD that we added to our closure in Step 2:
- Add Mfr_Country
- Now: \(\{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length, Mfr\_Country}\}\)
Step 5: No more FDs apply. Done!
Result: \(\{\text{Sensor\_ID}\}^+ = \{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length, Mfr\_Country}\}\)
This tells us: knowing the Sensor_ID, we can determine ALL attributes in the relation—not just sensor properties, but also which bridge it’s on, all that bridge’s properties, the manufacturer, and the manufacturer’s country (all via transitivity). This confirms that Sensor_ID is a superkey (and in fact, a key).
Another Example: \(\{\text{Bridge\_ID}\}^+\)
Step 1: Start with \(\{\text{Bridge\_ID}\}\)
Step 2: Apply \(F_1\):
- Add Bridge_Name, Year_Built, Length
- Now: \(\{\text{Bridge\_ID, Bridge\_Name, Year\_Built, Length}\}\)
Step 3: Can’t apply \(F_2\) (don’t have Sensor_ID) or \(F_3\) (don’t have Manufacturer)
Result: \(\{\text{Bridge\_ID}\}^+ = \{\text{Bridge\_ID, Bridge\_Name, Year\_Built, Length}\}\)
Bridge_ID only determines bridge properties, not sensor or manufacturer properties.
8.2.2.14 Closures and Keys
Closures provide a simple test for whether a set of attributes is a key.
Key Test Using Closure:
A set of attributes \(\bar{A}\) is a superkey for relation \(R\) if and only if \(\bar{A}^+\) contains all attributes of \(R\).
A superkey is a key if no proper subset of \(\bar{A}\) is also a superkey (minimality condition).
Example: Is \(\{\text{Sensor\_ID}\}\) a key for BridgeSensor?
From our earlier calculation: \(\{\text{Sensor\_ID}\}^+ = \{\text{Sensor\_ID, Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer, Bridge\_Name, Year\_Built, Length, Mfr\_Country}\}\)$
This is all 9 attributes! Therefore, \(\{\text{Sensor\_ID}\}\) is a superkey.
Is it a key? Since it’s a single attribute, it’s automatically minimal—we can’t remove any attributes from it. Therefore, \(\{\text{Sensor\_ID}\}\) is a key for BridgeSensor.
Example: Is \(\{\text{Bridge\_ID, Sensor\_ID}\}\) a key?
Let’s compute the closure:
Start: \(\{\text{Bridge\_ID, Sensor\_ID}\}\)
Apply \(F_1\) (Bridge_ID → Bridge_Name, Year_Built, Length):
- Add: Bridge_Name, Year_Built, Length
Apply \(F_2\) (Sensor_ID → Sensor_Type, Install_Date, Bridge_ID, Manufacturer):
- Add: Sensor_Type, Install_Date, Manufacturer (Bridge_ID already there)
Apply \(F_3\) (Manufacturer → Mfr_Country):
- Add: Mfr_Country
Result: All 9 attributes! So \(\{\text{Bridge\_ID, Sensor\_ID}\}\) is a superkey.
But is it a key? Check if any proper subset is also a superkey:
\(\{\text{Bridge\_ID}\}^+ = \{\text{Bridge\_ID, Bridge\_Name, Year\_Built, Length}\}\) — not all attributes
\(\{\text{Sensor\_ID}\}^+\) results in all attributes as we have seen before.
Since \(\{\text{Sensor\_ID}\}\) is a superkey and a proper subset of \(\{\text{Bridge\_ID, Sensor\_ID}\}\), the composite is not a key (it violates minimality). The key for BridgeSensor is just Sensor_ID—each row represents one sensor, and sensor IDs are unique.
Finding All Keys Systematically:
How do we find all candidate keys for a relation?
Naive approach: Test every subset of attributes. For \(n\) attributes, that’s \(2^n\) subsets—exponential and impractical for large relations.
Better approach: Use the minimality property
- Start by testing single attributes: If any attribute \(A\) has \(\{A\}^+ = \text{all attributes}\), then \(\{A\}\) is a key
- Test pairs of attributes: \(\{A, B\}\)
- Test triples: \(\{A, B, C\}\)
- And so on…
Key optimization: If \(\bar{C}\) is a key, then any superset \(\{\bar{C}, D\}\) is a superkey but not a key (violates minimality). So once we find a key of size \(k\), we don’t need to test any supersets of that key.
Example: For BridgeSensor with 9 attributes, we’d test:
- 9 single-attribute sets
- If none are keys, test \(\binom{9}{2} = 36\) pairs
- And so on…
In practice, domain knowledge often tells us what the keys are. But closure computation gives us a mechanical procedure when we’re unsure.
8.2.2.15 Specifying Functional Dependencies for a Relation
When designing a database, we need to specify which functional dependencies hold. But there are infinitely many true FDs (including all the trivial ones). How do we choose which ones to write down?
Logical Implication:
We say a set of FDs \(S_2\) follows from (or is implied by) a set \(S_1\) if:
Every relation instance that satisfies all FDs in \(S_1\) also satisfies all FDs in \(S_2\).
In other words, \(S_1\) logically entails \(S_2\).
Example:
Given: \(S_1 = \{\text{Sensor\_ID} \rightarrow \text{Bridge\_ID}, \text{Bridge\_ID} \rightarrow \text{Year\_Built}\}\)
Does \(S_1\) imply \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\)?
Yes! By transitivity. Any relation satisfying \(S_1\) must also satisfy \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\).
Testing If an FD Follows:
To test whether \(\bar{A} \rightarrow \bar{B}\) follows from a set \(S\) of FDs:
- Compute \(\bar{A}^+\) using the FDs in \(S\)
- If \(\bar{B} \subseteq \bar{A}^+\), then \(\bar{A} \rightarrow \bar{B}\) follows from \(S\)
- Otherwise, it doesn’t
Example:
Given FDs:
- \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)
- \(\text{Sensor\_ID} \rightarrow \text{Bridge\_ID, Sensor\_Type}\)
Does \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\) follow?
Compute \(\{\text{Sensor\_ID}\}^+\):
- Start: Sensor_ID
- Apply second FD: Add Bridge_ID, Sensor_Type
- Apply first FD: Add Bridge_Name, Year_Built, Length
- Result: {Sensor_ID, Bridge_ID, Sensor_Type, Bridge_Name, Year_Built, Length}
Since Year_Built is in the closure, yes, \(\text{Sensor\_ID} \rightarrow \text{Year\_Built}\) follows.
Minimal Basis (Canonical Cover):
Our goal when specifying FDs is to find a minimal set that captures all the dependencies without redundancy.
A set of FDs \(S\) is minimal (or a canonical cover) if:
- All FDs are completely nontrivial (right side doesn’t overlap with left side)
- No FD in \(S\) can be derived from the others (no redundancy)
- No attribute in any left side is redundant (can’t make any left side smaller)
Example of redundant FDs:
Consider:
- \(\text{A} \rightarrow \text{B}\)
- \(\text{B} \rightarrow \text{C}\)
- \(\text{A} \rightarrow \text{C}\)
The third FD is redundant—it follows from the first two by transitivity. A minimal set would omit it.
Why minimality matters:
- Clarity: Easier to understand the essential dependencies
- Efficiency: Fewer constraints to check when inserting/updating data
- Design: Minimal FD sets clearly reveal the structure needed for normalization
In practice, you identify FDs based on domain knowledge (understanding what real-world properties constrain the data), then optionally simplify to a minimal basis. The minimal basis is what you’ll use when decomposing relations.
8.2.3 Normal Forms
Now that we understand functional dependencies, we can study normal forms—quality criteria for database design. Each normal form eliminates certain types of anomalies by restricting what functional dependencies are allowed.
Normal forms form a hierarchy:
\[\text{1NF} \subset \text{2NF} \subset \text{3NF} \subset \text{BCNF}\]
Every relation in 2NF is also in 1NF. Every relation in 3NF is also in 2NF. And so on. Higher normal forms are “stricter” and eliminate more anomalies.
8.2.3.1 Why Normalization Matters
Unnormalized databases suffer from several problems:
- Anomalies: Redundancy leads to update, deletion, and insertion anomalies (as we saw earlier)
- Inefficient updates: Programs must update multiple copies of the same data
- Complex indexing: Indexing large, redundant tables is cumbersome
- Wasted storage: Repeated data consumes unnecessary space
- Maintenance nightmares: Changes to schema or constraints require updating many places
Normalization systematically eliminates these problems by decomposing relations according to their functional dependencies.
Important notes:
- Normalization should be part of the design process, not an afterthought
- The normalization process may reveal additional entities and relationships to add to your ER diagram
- ER modeling and normalization should be used together—ER diagrams help identify entities and relationships, normalization ensures the resulting schema is well-designed
8.2.3.2 The Running Example
We’ll use our BridgeSensor mega-relation from earlier:
BridgeSensor(Bridge_ID, Bridge_Name, Year_Built, Length, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)
With functional dependencies:
- \(F_1\): \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\)
- \(F_2\): \(\text{Sensor\_ID} \rightarrow \text{Sensor\_Type, Install\_Date, Bridge\_ID, Manufacturer}\)
- \(F_3\): \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)
Key: \(\{\text{Sensor\_ID}\}\) (as we discovered via closure)
Sample data:
| Bridge_ID | Bridge_Name | Year_Built | Length | Sensor_ID | Sensor_Type | Install_Date | Manufacturer | Mfr_Country |
|---|---|---|---|---|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 | S001 | accelerometer | 2020-01-15 | Acme Sensors | USA |
| B001 | Roberto Clemente Bridge | 1928 | 244 | S002 | strain_gauge | 2020-01-15 | Acme Sensors | USA |
| B001 | Roberto Clemente Bridge | 1928 | 244 | S003 | temperature | 2020-02-10 | TempTech | Germany |
| B002 | Fort Pitt Bridge | 1959 | 1201 | S004 | accelerometer | 2019-12-01 | Acme Sensors | USA |
We’ll progressively normalize this relation through 1NF, 2NF, 3NF, and finally BCNF.
8.2.3.3 First Normal Form (1NF)
Definition: A relation is in First Normal Form (1NF) if:
- All attribute values are atomic (indivisible)
- Each attribute contains only a single value from its domain (no sets, lists, or repeating groups)
- There is a primary key that uniquely identifies each row
What 1NF eliminates: Repeating groups and multi-valued attributes.
Simple Example of 1NF Violation:
Consider a poorly designed Employee relation:
| Employee_ID | Name | Phone_Numbers |
|---|---|---|
| E001 | Alice | 412-555-0100, 412-555-0101 |
| E002 | Bob | 412-555-0200 |
The Phone_Numbers column violates 1NF—it contains a list (multiple phone numbers in one cell).
To fix: Create separate rows for each phone number:
| Employee_ID | Name | Phone_Number |
|---|---|---|
| E001 | Alice | 412-555-0100 |
| E001 | Alice | 412-555-0101 |
| E002 | Bob | 412-555-0200 |
Now each cell contains a single atomic value.
Our BridgeSensor Example:
Our BridgeSensor relation is already in 1NF:
- All values are atomic (single values, not lists)
- Primary key exists: Sensor_ID
So we can move directly to analyzing higher normal forms. The real action begins with 2NF, where we address partial dependencies.
8.2.3.4 Second Normal Form (2NF)
Definition: A relation is in Second Normal Form (2NF) if:
- It is in 1NF, AND
- Every non-key attribute is fully functionally dependent on the entire primary key (no partial dependencies)
What is a partial dependency?
A partial dependency occurs when a non-key attribute depends on only part of a composite primary key, not the whole key.
If the primary key is a single attribute, partial dependencies are impossible—the relation is automatically in 2NF (assuming it’s in 1NF).
Modified BridgeSensor for 2NF Demonstration:
To demonstrate 2NF violations, let’s adjust our example slightly. Suppose sensor IDs are only unique within each bridge (Sensor 1 on Bridge A is different from Sensor 1 on Bridge B). Then:
Key: \(\{\text{Bridge\_ID, Sensor\_ID}\}\) (composite key)
BridgeSensor(Bridge_ID, Bridge_Name, Year_Built, Length, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)
Functional dependencies:
- \(\text{Bridge\_ID} \rightarrow \text{Bridge\_Name, Year\_Built, Length}\) ← Partial! (depends only on part of key)
- \(\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \text{Sensor\_Type, Install\_Date, Manufacturer}\)
- \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\)
Problem: Bridge_Name, Year_Built, and Length depend only on Bridge_ID, not on the full key {Bridge_ID, Sensor_ID}. This is a partial dependency, violating 2NF.
Anomalies from 2NF violation:
Looking at the data:
| Bridge_ID | Bridge_Name | Year_Built | Length | Sensor_ID | Sensor_Type | Install_Date | Manufacturer | Mfr_Country |
|---|---|---|---|---|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 | 1 | accelerometer | 2020-01-15 | Acme Sensors | USA |
| B001 | Roberto Clemente Bridge | 1928 | 244 | 2 | strain_gauge | 2020-01-15 | Acme Sensors | USA |
| B001 | Roberto Clemente Bridge | 1928 | 244 | 3 | temperature | 2020-02-10 | TempTech | Germany |
Bridge information repeats for each sensor—classic redundancy from partial dependency.
Decomposition to 2NF:
Remove attributes that depend on only part of the key:
Bridge(Bridge_ID, Bridge_Name, Year_Built, Length)
- Primary key: Bridge_ID
- No partial dependencies (key is single attribute)
BridgeSensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)
- Primary key: {Bridge_ID, Sensor_ID}
- All non-key attributes depend on the full key
Now the tables:
Bridge:
| Bridge_ID | Bridge_Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
BridgeSensor:
| Bridge_ID | Sensor_ID | Sensor_Type | Install_Date | Manufacturer | Mfr_Country |
|---|---|---|---|---|---|
| B001 | 1 | accelerometer | 2020-01-15 | Acme Sensors | USA |
| B001 | 2 | strain_gauge | 2020-01-15 | Acme Sensors | USA |
| B001 | 3 | temperature | 2020-02-10 | TempTech | Germany |
| B002 | 1 | accelerometer | 2019-12-01 | Acme Sensors | USA |
Benefits of 2NF:
- Eliminated redundancy: Bridge information appears only once
- No update anomalies: Changing a bridge’s length requires updating one row
- No deletion anomalies: Removing all sensors from a bridge doesn’t delete bridge information
- No insertion anomalies: Can add a bridge without sensors
Note: Relations with single-attribute primary keys (like Bridge above) are automatically in 2NF if they’re in 1NF, since partial dependencies are impossible.
Next problem: BridgeSensor is now in 2NF, but still has issues. Notice Manufacturer appears multiple times, and Mfr_Country is redundant. This is a transitive dependency, which 3NF addresses.
8.2.3.5 Third Normal Form (3NF)
Definition: A relation is in Third Normal Form (3NF) if:
- It is in 2NF, AND
- No non-key attribute is transitively dependent on the primary key
What is a transitive dependency?
A transitive dependency occurs when: - \(\text{Key} \rightarrow A \rightarrow B\)
In other words, a non-key attribute \(A\) determines another non-key attribute \(B\). The key determines \(B\) indirectly through \(A\) (by transitivity).
Example from BridgeSensor (currently in 2NF):
BridgeSensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer, Mfr_Country)
Functional dependencies:
- \(\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \text{Manufacturer}\) (key determines Manufacturer)
- \(\text{Manufacturer} \rightarrow \text{Mfr\_Country}\) (non-key determines non-key)
By transitivity: \(\{\text{Bridge\_ID, Sensor\_ID}\} \rightarrow \text{Manufacturer} \rightarrow \text{Mfr\_Country}\)
This is a transitive dependency—the key determines Mfr_Country indirectly through Manufacturer. Violates 3NF!
Problem with transitive dependencies:
Look at the data:
| Bridge_ID | Sensor_ID | Sensor_Type | Install_Date | Manufacturer | Mfr_Country |
|---|---|---|---|---|---|
| B001 | 1 | accelerometer | 2020-01-15 | Acme Sensors | USA |
| B001 | 2 | strain_gauge | 2020-01-15 | Acme Sensors | USA |
| B001 | 3 | temperature | 2020-02-10 | TempTech | Germany |
| B002 | 1 | accelerometer | 2019-12-01 | Acme Sensors | USA |
Notice “Acme Sensors → USA” appears three times. If Acme Sensors moved to Canada:
- We’d need to update multiple rows (update anomaly)
- If we update inconsistently, we have contradictory data
Decomposition to 3NF:
Identify the transitive dependency and split it out:
- Identify determinants: Manufacturer is a determinant (determines Mfr_Country)
- Create new relation: Make a relation with Manufacturer as the key
- Remove dependent attributes: Remove Mfr_Country from BridgeSensor
Sensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer)
- Primary key: {Bridge_ID, Sensor_ID}
- Foreign key: Manufacturer references Manufacturer table
- No transitive dependencies!
Manufacturer(Manufacturer, Mfr_Country)
- Primary key: Manufacturer
- Simple relation, no transitive dependencies
Now the tables:
Sensor:
| Bridge_ID | Sensor_ID | Sensor_Type | Install_Date | Manufacturer |
|---|---|---|---|---|
| B001 | 1 | accelerometer | 2020-01-15 | Acme Sensors |
| B001 | 2 | strain_gauge | 2020-01-15 | Acme Sensors |
| B001 | 3 | temperature | 2020-02-10 | TempTech |
| B002 | 1 | accelerometer | 2019-12-01 | Acme Sensors |
Manufacturer:
| Manufacturer | Mfr_Country |
|---|---|
| Acme Sensors | USA |
| TempTech | Germany |
Benefits of 3NF:
- No redundancy from transitive dependencies: Each manufacturer’s country appears once
- No update anomalies: Changing Acme Sensors’ country requires one update
- No deletion anomalies: Deleting all sensors from a manufacturer doesn’t delete manufacturer info
- Better data integrity: Can’t have inconsistent countries for the same manufacturer
Summary so far:
We now have three relations, all in 3NF:
- Bridge(Bridge_ID, Bridge_Name, Year_Built, Length)
- Sensor(Bridge_ID, Sensor_ID, Sensor_Type, Install_Date, Manufacturer)
- Manufacturer(Manufacturer, Mfr_Country)
Are we done? For most practical purposes, yes—3NF is often good enough. But there’s one more normal form worth knowing: BCNF, which handles a subtle edge case.
8.2.3.6 Boyce-Codd Normal Form (BCNF)
Definition: A relation is in Boyce-Codd Normal Form (BCNF) if:
For every nontrivial functional dependency \(\bar{A} \rightarrow \bar{B}\), \(\bar{A}\) is a superkey (contains a candidate key).
In simpler terms: Every determinant must be a candidate key.
BCNF is a stricter version of 3NF. While 3NF says “no non-key attribute determines another non-key attribute,” BCNF says “ONLY keys can be determinants.”
Relationship to 3NF:
- Every relation in BCNF is also in 3NF
- Most relations in 3NF are also in BCNF
- BCNF violations occur in special cases, typically when there are overlapping candidate keys
When does BCNF differ from 3NF?
BCNF catches cases where a non-key attribute determines a key attribute (or part of a key). This doesn’t violate 3NF (which only restricts non-key → non-key dependencies), but it still causes anomalies.
Example: Bridge Inspection Schedule
Consider a relation tracking bridge inspections:
Inspection(Inspector, Bridge, Bridge_Type, Inspection_Date)
Business rules:
- Each inspector specializes in exactly one bridge type (arch, suspension, truss)
- Each bridge has exactly one type
- Each inspector inspects each bridge at most once
Functional dependencies:
- \(\text{Inspector} \rightarrow \text{Bridge\_Type}\) (each inspector specializes in one type)
- \(\text{Bridge} \rightarrow \text{Bridge\_Type}\) (each bridge has one type)
- \(\{\text{Inspector, Bridge}\} \rightarrow \text{Inspection\_Date}\) (key determines date)
Keys: \(\{\text{Inspector, Bridge}\}\) is the only candidate key
Sample data:
| Inspector | Bridge | Bridge_Type | Inspection_Date |
|---|---|---|---|
| Alice | B001 | suspension | 2024-01-15 |
| Alice | B002 | suspension | 2024-02-10 |
| Bob | B003 | arch | 2024-01-20 |
| Bob | B004 | arch | 2024-03-05 |
| Carol | B005 | truss | 2024-02-01 |
Is this in 3NF?
Check for transitive dependencies (non-key → non-key):
- Bridge_Type is determined by Inspector (both are part of different candidate keys or non-key)
- But wait—Inspector and Bridge are both part of the primary key!
- No non-key attribute determines another non-key attribute
Yes, this is in 3NF.
Is this in BCNF?
Check if every determinant is a superkey:
- \(\text{Inspector} \rightarrow \text{Bridge\_Type}\), but Inspector alone is NOT a superkey
- Inspector is a determinant but not a candidate key!
No, this violates BCNF.
Problem—Anomalies despite being in 3NF:
Look at the data again:
| Inspector | Bridge | Bridge_Type | Inspection_Date |
|---|---|---|---|
| Alice | B001 | suspension | 2024-01-15 |
| Alice | B002 | suspension | 2024-02-10 |
Bridge_Type “suspension” is repeated for every bridge Alice inspects.
- Update anomaly: If Alice switches specializations from suspension to arch, we must update every row where she appears
- Insertion anomaly: We can’t record that a new inspector specializes in truss bridges until they inspect a bridge
- Deletion anomaly: If Bob stops inspecting B003 and B004, we lose the information that Bob specializes in arch bridges
Decomposition to BCNF:
Split based on the problematic FD \(\text{Inspector} \rightarrow \text{Bridge\_Type}\):
InspectorSpecialty(Inspector, Bridge_Type)
- Primary key: Inspector
- FD: \(\text{Inspector} \rightarrow \text{Bridge\_Type}\) ✓ (determinant IS a key)
BridgeInspection(Inspector, Bridge, Inspection_Date)
- Primary key: {Inspector, Bridge}
- FD: \(\{\text{Inspector, Bridge}\} \rightarrow \text{Inspection\_Date}\) ✓ (determinant IS a key)
Now the tables:
InspectorSpecialty:
| Inspector | Bridge_Type |
|---|---|
| Alice | suspension |
| Bob | arch |
| Carol | truss |
BridgeInspection:
| Inspector | Bridge | Inspection_Date |
|---|---|---|
| Alice | B001 | 2024-01-15 |
| Alice | B002 | 2024-02-10 |
| Bob | B003 | 2024-01-20 |
| Bob | B004 | 2024-03-05 |
| Carol | B005 | 2024-02-01 |
Both relations are now in BCNF. Every determinant is a candidate key.
Benefits of BCNF:
- No redundancy: Each inspector’s specialty appears once
- No anomalies: All the update/insertion/deletion anomalies are eliminated
- Stronger guarantee: BCNF is the “gold standard” for functional dependency-based normal forms
Note: We also need the FD \(\text{Bridge} \rightarrow \text{Bridge\_Type}\) to be enforced. We’d need:
BridgeType(Bridge, Bridge_Type)
Making our final BCNF schema:
- InspectorSpecialty(Inspector, Bridge_Type)
- BridgeInspection(Inspector, Bridge, Inspection_Date)
- BridgeType(Bridge, Bridge_Type)
When to stop at 3NF vs. push to BCNF:
BCNF is the theoretical ideal: No anomalies from functional dependencies
3NF is sometimes preferred when achieving BCNF would:
- Require too many joins for common queries (performance cost)
- Lose dependency preservation (can’t enforce all original FDs with just the decomposed tables)
For most applications, BCNF is the target. But there are legitimate cases where 3NF is good enough.
8.2.3.7 Higher Normal Forms
BCNF handles all anomalies arising from functional dependencies. But there are other types of dependencies that cause anomalies, leading to even higher normal forms.
Fourth Normal Form (4NF):
4NF addresses multivalued dependencies (MVDs)—situations where one attribute determines a set of values for another attribute, independently of other attributes.
Example: Suppose an instructor teaches multiple courses and has multiple phone numbers, independently:
| Instructor | Course | Phone |
|---|---|---|
| Dr. Smith | DB Design | 555-0100 |
| Dr. Smith | DB Design | 555-0101 |
| Dr. Smith | Data Mining | 555-0100 |
| Dr. Smith | Data Mining | 555-0101 |
Every combination appears, creating redundancy. The multivalued dependency is:
\[\text{Instructor} \twoheadrightarrow \text{Course}\] \[\text{Instructor} \twoheadrightarrow \text{Phone}\]
To achieve 4NF, split into two relations:
- InstructorCourse(Instructor, Course)
- InstructorPhone(Instructor, Phone)
Fifth Normal Form (5NF) and Beyond:
5NF (also called Projection-Join Normal Form) deals with join dependencies—complex cases where information can only be reconstructed by joining three or more tables.
Higher normal forms (6NF, Domain-Key Normal Form) exist but are primarily of theoretical interest. They’re rarely used in practice because:
- BCNF eliminates virtually all practical anomalies in most databases
- Higher normal forms can require excessive joins, hurting query performance
- The additional complexity rarely justifies the marginal benefit
Practical Advice:
Don’t assume the highest level of normalization is always the most desirable.
BCNF is the sweet spot for most applications—strong guarantees without excessive complexity
Higher normal forms require more joins, which can slow query performance
Denormalization (intentionally violating normal forms) is sometimes appropriate for:
- Read-heavy applications where query speed matters more than update efficiency
- Data warehouses optimized for analytics
- Caching frequently accessed aggregated data
The key is understanding the tradeoffs and making informed design decisions based on your specific requirements.