6 Module 5: The Relational Model
6.1 Module Overview
6.1.1 Learning Objectives
By the end of this module, students will be able to:
- Understand the motivation behind the development of the relational model, and be able to provide a definition of it in your own words
- Describe the high-level connection between set theory (from Module 4) and the relational model
- Identify and relate the following concepts: relations, attributes, tuples and types; as well as connect them to their corresponding analogous terms: tables, columns, rows and domain
- Describe the meaning of key attributes and understand the difference between a schema and an instance of a database
- Understand the steps of relational database creation: design (using a DDL), initial data load, and query execution
- Identify the function of a query and how natural language queries for relational databases can be formalized with relational algebra
- Understand the link between relational algebra and SQL, as well as the fact that queries return relations
- Apply and reason through the application of the following relational algebra operators and their compositions: select, project, cross product, natural join, theta join, union, difference, intersection, rename
- Translate E/R diagrams to relational models of databases
6.1.2 Topics Covered
- Introduction to the Relational Model
- Historical context and motivation
- Definition of the relational model
- Connection to set theory from Module 4
- Relational Model Components
- Relations, attributes, tuples, types
- Analogous terms: tables, columns, rows, domains
- Key attributes and constraints
- Schema vs. instance
- Relational Database Creation Process
- Database design using DDL (Data Definition Language)
- Initial data loading
- Query execution
- Queries and Query Languages
- Purpose and function of queries
- Formalizing natural language queries with relational algebra
- Link between relational algebra and SQL
- Query results as relations
- Relational Algebra Operators
- Select (\(\sigma\))
- Project (\(\pi\))
- Cross Product (\(\times\))
- Natural Join (\(\bowtie\))
- Theta Join (\(\bowtie_{\theta}\))
- Union (\(\cup\))
- Difference (\(-\))
- Intersection (\(\cap\))
- Rename (\(\rho\))
- Composing operators for complex queries
- From ER Diagrams to Relational Models
- Converting entity sets to relations
- Translating relationships (1:1, 1:n, n:m)
- Handling weak entity sets
- Representing ISA hierarchies
- Complete example: Bridge monitoring system from Module 4
6.1.3 Project Milestones
Meet with your teammates to discuss final project ideas and select a winning one.
6.2 Lecture Notes
6.2.1 From ER Diagrams to Actual Databases
In Module 4, we learned two fundamental concepts: Entity Relationship (ER) diagrams for visualizing database structure, and set theory for understanding the mathematical foundations of data. ER diagrams helped us answer questions like “What entities exist in our system?” and “How are they related?” Set theory gave us the mathematical language to describe collections of data and operations on them.
But ER diagrams are design tools—they help us think through and communicate what our database should look like. They don’t actually store data or answer queries. To build a working database system, we need to translate our ER diagram design into a concrete implementation. This is where the relational model comes in.
The relational model provides the foundation for nearly all modern database systems. It gives us:
- A precise way to represent data (as relations/tables)
- A formal language to query data (relational algebra)
- Rules for translating ER diagrams into actual database structures
An important insight: Any database design—and the requirements for it—ultimately comes from envisioned use cases. In other words, understanding how users will interact with your database (what questions they’ll ask, what queries they’ll run) is the most important starting point for design. You might have a perfect ER diagram, but if it doesn’t support the queries your users need to run efficiently, it’s not serving its purpose.
As we progress through this module, keep in mind this design principle: the structure of your database should be driven by the queries you anticipate running. The relational model gives us both the structure (how to organize data in tables) and the query language (relational algebra) to make this happen.
6.2.2 Introduction to the Relational Model
The relational model is used by almost all major commercial database systems today—from PostgreSQL and MySQL to Oracle and Microsoft SQL Server. Despite being over 40 years old, it remains the dominant approach for organizing and querying data.
At its core, a Database Management System (DBMS) provides efficient, reliable, convenient, and safe multi-user storage of and access to massive amounts of persistent data. The relational model is the theoretical foundation that makes this possible, combined with high-level query languages that are both simple to use and expressive enough to handle complex data needs.
6.2.2.1 What is the Relational Model?
The relational model organizes data using a few simple but powerful concepts:
- Database = a set of named relations (also called tables)
- Each relation has a set of named attributes (also called columns)
- Each tuple (also called row) contains a value for each attribute
- Each attribute has a type (also called domain) that constrains what values it can hold
Additionally:
- A key is an attribute (or set of attributes) whose value uniquely identifies each tuple in the relation
- NULL is a special value representing “unknown” or “undefined”
Two critical concepts distinguish the structure from the data itself:
- Schema = the structural description of relations in the database (the set of relations, their attributes, types, and constraints)
- Instance = the actual contents of the database at a given point in time (the actual data)
The schema is like a blueprint—it describes what the database looks like. The instance is the current state—the actual rows of data stored at this moment.
6.2.2.2 Historical Context and Motivation
The relational model was introduced by Edgar F. Codd in 1970 in his seminal paper “A Relational Model of Data for Large Shared Data Banks.” Before Codd’s work, databases were navigational—you had to write code that explicitly navigated through data structures using pointers and links, similar to traversing linked lists or trees in programming.
I actually had not gone to the paper itself before, but in confirming what I got in these notes from Claude, I went to the source (so that I could include the link to the URL) today. The foresight of this 55 year old paper is quite remarkable. With the relational algebra we went over today, you might be able to understand the whole paper.
Codd’s revolutionary insight was that data could be organized as simple tables (relations) and queried using high-level, declarative languages based on mathematical logic. Instead of telling the computer how to find data (navigate this pointer, follow that link), you simply describe what you want, and the DBMS figures out how to retrieve it efficiently.
This abstraction dramatically simplified database programming and made databases accessible to non-programmers. It’s why SQL (Structured Query Language), which implements Codd’s relational algebra, became ubiquitous.
6.2.2.3 Connection to Set Theory
The relational model is fundamentally grounded in set theory, which we studied in Module 4. This connection isn’t coincidental—Codd built the relational model on the mathematical foundation of sets.
Here’s how the concepts map:
- A relation is a set of tuples (no duplicate rows, unordered)
- Relational algebra operations (select, project, join, union, etc.) are set operations
- The Cartesian product from set theory becomes the cross product in relational algebra
- Union, intersection, and difference work exactly as they do in set theory
This mathematical foundation gives the relational model:
- Precision: Operations are well-defined mathematically
- Compositionality: Query results are relations, so you can nest queries
- Optimization: Mathematical properties allow DBMSs to rewrite queries for efficiency
Throughout this module, you’ll see how the set operations you learned in Module 4 translate directly into relational algebra operators for querying databases.
6.2.3 Relational Model Components
Now let’s examine the core components of the relational model in detail. We’ll use examples from the bridge monitoring system we designed in Module 4 to make these concepts concrete.
6.2.3.1 Relations, Tuples, and Attributes
At the mathematical core, a relation is a set of tuples that share the same structure. Each tuple is an ordered collection of values, and each value corresponds to an attribute of the relation.
Formal definition: If we have attributes \(A_1, A_2, \ldots, A_n\) with corresponding domains (sets of allowed values) \(D_1, D_2, \ldots, D_n\), then a relation \(R\) is a subset of the Cartesian product:
\[R \subseteq D_1 \times D_2 \times \cdots \times D_n\]
In simpler terms: a relation is a set of n-tuples, where each tuple has the same structure—values for the same set of attributes.
Example: Consider a Bridge relation with attributes (Bridge_ID, Name, Year_Built, Length). A tuple in this relation might be:
\[(B001, \text{"Roberto Clemente Bridge"}, 1928, 244)\]
This tuple represents one bridge in our monitoring system.
Key properties of relations:
- Tuples are unordered: There’s no “first” or “last” tuple (they’re elements of a set)
- No duplicate tuples: Each tuple in a relation must be unique (set property)
- Attributes are atomic: Each attribute value is indivisible (no nested structures)
6.2.3.2 Tables, Rows, and Columns: The Common Terminology
While “relation,” “tuple,” and “attribute” are the formal mathematical terms, database practitioners use more intuitive terminology:
| Mathematical Term | Database Term | What It Means |
|---|---|---|
| Relation | Table | A collection of related data |
| Tuple | Row | A single record in the table |
| Attribute | Column | A property or field of the records |
Example: Here’s a Bridge table visualized:
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
| B003 | Fort Duquesne Bridge | 1969 | 1201 |
In this table:
- Each row is a tuple representing one bridge
- Each column is an attribute (property) of bridges
- The entire table is the Bridge relation
Throughout this course, we’ll use both sets of terms interchangeably, but remember they refer to the same concepts.
6.2.3.3 Domains and Types
Every attribute has a domain (or type)—the set of all possible values that attribute can take.
Examples of domains:
Bridge_ID: String of format “B” followed by 3 digits (e.g., “B001”, “B002”)Year_Built: Integer between 1800 and current yearLength: Positive real number (meters)Sensor_Type: Enumeration from {“accelerometer”, “strain_gauge”, “temperature”, “displacement”}
Domains serve several purposes:
- Data integrity: Prevent nonsensical values (e.g., negative lengths, future construction years)
- Storage optimization: DBMS can use appropriate storage formats (integers vs. strings vs. floats)
- Query validation: Catch errors at query time (e.g., comparing incompatible types)
In practice, database systems provide built-in types (INTEGER, VARCHAR, DATE, FLOAT, etc.) and allow you to define constraints that further restrict domains.
6.2.3.4 Keys and Constraints
A key is an attribute (or set of attributes) that uniquely identifies each tuple in a relation. No two tuples can have the same values for all key attributes.
Example: In our Bridge relation, Bridge_ID is a key—each bridge has a unique ID:
| Bridge_ID (KEY) | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
| B003 | Fort Duquesne Bridge | 1969 | 1201 |
Notice that two bridges can have the same Length or even the same Name, but they must have different Bridge_ID values.
6.2.3.5 Schema vs. Instance
These two concepts are fundamental to understanding databases:
Schema = The structure or blueprint of the database
The schema defines:
- What relations exist
- What attributes each relation has
- The types/domains of each attribute
- What keys and constraints apply
The schema is relatively stable—it changes only when you redesign the database structure.
Instance = The actual data in the database at a specific moment
The instance contains:
- The actual tuples (rows) in each relation
- The specific values stored
The instance changes frequently—every time you insert, update, or delete data.
Example:
Bridge Schema:
Bridge(Bridge_ID: STRING, Name: STRING, Year_Built: INTEGER, Length: REAL)
PRIMARY KEY: Bridge_ID
CONSTRAINT: Year_Built > 1800
Bridge Instance (October 2024):
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
| B003 | Fort Duquesne Bridge | 1969 | 1201 |
Bridge Instance (November 2024, after adding a new bridge):
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
| B003 | Fort Duquesne Bridge | 1969 | 1201 |
| B004 | Smithfield Street Bridge | 1883 | 361 |
The schema remained the same—only the instance changed.
Key insight: When we design databases (including when we translate ER diagrams to relational schemas), we’re defining the schema. When users interact with databases (querying, inserting, updating data), they’re working with the instance.
6.2.4 Relational Database Creation Process
Once you’ve designed your database schema (often starting from an ER diagram), you need to actually create and populate the database. This process typically involves three main steps: defining the structure using DDL, loading initial data, and executing queries. Let’s examine each step.
6.2.4.1 Database Design with DDL
DDL (Data Definition Language) is the subset of database commands used to define and modify the structure of your database—creating tables, specifying attributes, defining keys and constraints.
Here’s an example of creating our Bridge table using SQL DDL syntax (note: while we’re focusing on relational algebra in this module, seeing a bit of DDL helps understand the connection between abstract schemas and concrete databases):
CREATE TABLE Bridge (
Bridge_ID VARCHAR(10) PRIMARY KEY,
Name VARCHAR(100) NOT NULL,
Year_Built INTEGER CHECK (Year_Built > 1800)
);
This DDL statement:
- Creates a table named
Bridge - Defines four attributes with their types
- Specifies
Bridge_IDas the primary key - Adds NOT NULL constraint (Name must have a value)
Some Key DDL operations:
CREATE TABLE: Define a new relationALTER TABLE: Modify an existing table structureDROP TABLE: Delete a table
6.2.4.2 Data Loading
After defining the schema, you need to populate the database with actual data. This is called data loading or data insertion.
Example: Loading bridges into our system:
INSERT INTO Bridge (Bridge_ID, Name, Year_Built, Length)
VALUES ('B001', 'Roberto Clemente Bridge', 1928, 244);
INSERT INTO Bridge (Bridge_ID, Name, Year_Built, Length)
VALUES ('B002', 'Fort Pitt Bridge', 1959, 1201);
INSERT INTO Bridge (Bridge_ID, Name, Year_Built, Length)
VALUES ('B003', 'Fort Duquesne Bridge', 1969, 1201);
For large datasets, DBMSs typically provide bulk loading utilities that can read data from CSV files or other formats:
COPY Bridge FROM '/path/to/bridges.csv'
WITH (FORMAT CSV, HEADER TRUE);
Data loading considerations:
- Constraint validation: The DBMS checks all constraints during insertion (keys, but also other constraints we haven’t seen yet such as foreign keys, CHECK constraints)
- Transaction handling: Large data loads are often wrapped in transactions for consistency
- Performance: Bulk loading is much faster than individual inserts
6.2.4.3 Query Execution
Once your database has structure (schema) and data (instance), users can query it—asking questions and retrieving information.
Example natural language query: “Find all bridges built before 1950”
This would be expressed in SQL as:
SELECT * FROM Bridge
WHERE Year_Built < 1950;
And would return:
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B004 | Smithfield Street Bridge | 1883 | 361 |
What happens during query execution?
- Parsing: The DBMS parses the query to check syntax
- Validation: Checks that tables and columns exist, types match, etc.
- Optimization: The DBMS creates an execution plan (how to actually retrieve the data efficiently)
- Execution: The plan is executed, accessing the stored data
- Result: The result is returned as a new relation (table)
Key insight: The relational model’s power comes from declarative queries—you specify what you want, not how to get it. The DBMS handles the “how” through query optimization. This is fundamentally different from navigational databases where you had to specify the exact path through data structures.
In the next sections, we’ll learn relational algebra—the formal mathematical language that underlies these queries and allows us to reason about what queries compute.
6.2.5 Queries and Query Languages
Now that we understand how databases are structured and created, let’s focus on how we actually use them—by asking questions and retrieving information through queries.
A query is a request for information from a database. Some queries are easy to pose (“Find all bridges”), while others are more complex (“Find all sensors on bridges built before 1950 that have recorded readings above threshold in the past month”). Similarly, some queries are easy for a DBMS to execute efficiently, while others require significant computation.
An important note: the term “query language” is somewhat misleading—these languages are also used to modify data (inserting, updating, deleting), not just retrieve it. But historically, they’re called query languages because retrieval was their primary purpose.
A fundamental property: In the relational model, queries return relations. This property is called closure or compositionality—the result of a query is itself a relation, which means you can use it as input to another query. This enables building complex queries from simpler ones.
6.2.5.1 What is a Query?
At the most basic level, a query expresses an information need:
- Natural language: “Show me all bridges built before 1950”
- Formal language (relational algebra): \(\sigma_{\text{Year\_Built} < 1950}(\text{Bridge})\)
- SQL:
SELECT * FROM Bridge WHERE Year_Built < 1950
All three express the same information need, but with different levels of precision and formality.
Queries typically involve:
- Selection: Choosing which rows satisfy certain conditions
- Projection: Choosing which columns to include in results
- Combination: Joining data from multiple tables
- Aggregation: Computing summary statistics (count, average, etc.)
The power of declarative query languages is that you specify what you want, not how to get it. The DBMS figures out an efficient execution strategy.
6.2.5.2 Relational Algebra as a Query Language
Relational algebra is the formal mathematical language for expressing queries on relational databases. It consists of a set of operators that take one or more relations as input and produce a new relation as output.
Relational algebra serves several important purposes:
- Formal foundation: Provides a precise mathematical definition of what queries compute
- Query optimization: DBMSs use relational algebra internally to reason about and optimize queries
- Conceptual tool: Helps us understand what operations are possible and how they compose
The core relational algebra operators are:
- Select (\(\sigma\)): Picks certain rows based on a condition
- Project (\(\pi\)): Picks certain columns
- Cross Product (\(\times\)): Combines all rows from two tables
- Natural Join (\(\bowtie\)): Combines rows from two tables based on common attributes
- Union (\(\cup\)): Combines rows from two tables (with same schema)
- Difference (\(-\)): Rows in first table but not in second
- Intersection (\(\cap\)): Rows in both tables
- Rename (\(\rho\)): Changes relation or attribute names
Notice how these connect to the set operations from Module 4! Union, difference, and intersection work exactly as they did with sets. The relational-specific operators (select, project, join) extend set theory for working with structured data.
We’ll explore each operator in detail in the next section.
6.2.5.3 The Link to SQL
SQL (Structured Query Language) is the practical implementation of relational algebra used by virtually all commercial database systems. While relational algebra is the mathematical foundation, SQL is what you actually write when working with databases.
Correspondence between relational algebra and SQL:
| Relational Algebra | SQL Equivalent |
|---|---|
| \(\sigma_{\text{condition}}(R)\) | SELECT * FROM R WHERE condition |
| \(\pi_{\text{A,B}}(R)\) | SELECT A, B FROM R |
| \(R \times S\) | SELECT * FROM R, S (or R CROSS JOIN S) |
| \(R \bowtie S\) | SELECT * FROM R NATURAL JOIN S |
| \(R \cup S\) | SELECT * FROM R UNION SELECT * FROM S |
Key differences:
- Duplicates: Relational algebra is based on sets (no duplicates). SQL is based on multisets/bags (allows duplicates unless you use
DISTINCT) - NULL values: SQL includes NULL for unknown/undefined values; pure relational algebra doesn’t
- Syntax: SQL is more verbose but arguably more readable for complex queries
- Additional features: SQL includes many features beyond basic relational algebra (aggregation, grouping, subqueries, etc.)
In this module, we’re focusing on relational algebra because it provides the clean mathematical foundation. Understanding relational algebra makes learning SQL straightforward—you’ll already understand what operations mean, and SQL syntax is just a different notation.
Example: Let’s see the same query in both notations.
Query: “Find the names of bridges built before 1950”
Relational Algebra: \[\pi_{\text{Name}}(\sigma_{\text{Year\_Built} < 1950}(\text{Bridge}))\]
SQL:
SELECT Name
FROM Bridge
WHERE Year_Built < 1950;
Both express: first filter bridges to those built before 1950 (select/WHERE), then extract just the Name column (project/SELECT Name).
6.2.6 Relational Algebra Operators
Now we’ll examine each relational algebra operator in detail. For each operator, we’ll provide a formal definition, explain what it does intuitively, and show examples using our bridge monitoring system.
Throughout this section, we’ll use these sample relations:
Bridge:
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
| B003 | Fort Duquesne Bridge | 1969 | 1201 |
| B004 | Smithfield Street Bridge | 1883 | 361 |
Sensor:
| Sensor_ID | Type | Bridge_ID | Install_Date |
|---|---|---|---|
| S001 | accelerometer | B001 | 2020 |
| S002 | strain_gauge | B001 | 2020 |
| S003 | temperature | B002 | 2019 |
| S004 | accelerometer | B003 | 2021 |
6.2.6.1 Select (\(\sigma\))
The select operator picks certain rows from a relation based on a condition. It filters the relation to include only tuples that satisfy the specified predicate.
Notation: \(\sigma_{\text{condition}}(R)\)
Definition: \(\sigma_{\text{condition}}(R) = \{t \mid t \in R \text{ and condition}(t) \text{ is true}\}\)
In words: Select returns all tuples \(t\) from relation \(R\) where the condition evaluates to true.
Example 1: Find all bridges built before 1950
\[\sigma_{\text{Year\_Built} < 1950}(\text{Bridge})\]
Result:
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B004 | Smithfield Street Bridge | 1883 | 361 |
Example 2: Find all accelerometer sensors
\[\sigma_{\text{Type} = \text{'accelerometer'}}(\text{Sensor})\]
Result:
| Sensor_ID | Type | Bridge_ID | Install_Date |
|---|---|---|---|
| S001 | accelerometer | B001 | 2020 |
| S004 | accelerometer | B003 | 2021 |
Key points:
- Select operates on rows (horizontal filtering)
- The condition can involve comparisons (\(=\), \(<\), \(>\), \(\leq\), \(\geq\), \(\neq\)), logical operators (\(\land\), \(\lor\), \(\neg\)), and arithmetic
- The result schema is identical to the input schema—only rows are filtered
6.2.6.2 Project (\(\pi\))
The project operator picks certain columns from a relation. It extracts a subset of attributes while discarding the others.
Notation: \(\pi_{\text{A}_1, \text{A}_2, \ldots, \text{A}_k}(R)\)
Definition: \(\pi_{\text{A}_1, \ldots, \text{A}_k}(R) = \{(t.\text{A}_1, \ldots, t.\text{A}_k) \mid t \in R\}\)
In words: Project returns tuples containing only the specified attributes from \(R\).
Example 1: Get just the names and years of bridges
\[\pi_{\text{Name, Year\_Built}}(\text{Bridge})\]
Result:
| Name | Year_Built |
|---|---|
| Roberto Clemente Bridge | 1928 |
| Fort Pitt Bridge | 1959 |
| Fort Duquesne Bridge | 1969 |
| Smithfield Street Bridge | 1883 |
Example 2: Get sensor types
\[\pi_{\text{Type}}(\text{Sensor})\]
Result:
| Type |
|---|
| accelerometer |
| strain_gauge |
| temperature |
Key points:
- Project operates on columns (vertical filtering)
- Duplicates are always eliminated in relational algebra (since results are sets). Notice in Example 2 we only get one “accelerometer” even though there are two accelerometer sensors
- SQL behaves differently: it keeps duplicates unless you use
DISTINCT. In SQL,SELECT Type FROM Sensorwould return “accelerometer” twice, whileSELECT DISTINCT Type FROM Sensormatches the relational algebra behavior - The result schema contains only the projected attributes
6.2.6.3 Cross Product (\(\times\))
The cross product (also called Cartesian product) combines two relations by pairing each tuple from the first relation with every tuple from the second relation.
Notation: \(R \times S\)
Definition: \(R \times S = \{(r, s) \mid r \in R \text{ and } s \in S\}\)
In words: The cross product returns all possible pairs of tuples, one from \(R\) and one from \(S\).
Example: Consider smaller relations for clarity:
SmallBridge:
| Bridge_ID | Name |
|---|---|
| B001 | Roberto Clemente |
| B002 | Fort Pitt |
SmallSensor:
| Sensor_ID | Type |
|---|---|
| S001 | accelerometer |
| S002 | strain_gauge |
\[\text{SmallBridge} \times \text{SmallSensor}\]
Result:
| Bridge_ID | Name | Sensor_ID | Type |
|---|---|---|---|
| B001 | Roberto Clemente | S001 | accelerometer |
| B001 | Roberto Clemente | S002 | strain_gauge |
| B002 | Fort Pitt | S001 | accelerometer |
| B002 | Fort Pitt | S002 | strain_gauge |
Key points:
- If \(R\) has \(m\) tuples and \(S\) has \(n\) tuples, then \(R \times S\) has \(m \times n\) tuples
- Cross product by itself is rarely useful—it generates all possible combinations, including nonsensical ones (e.g., Bridge B001 paired with a sensor actually on Bridge B002)
- It becomes useful when combined with select to filter for meaningful pairings (this is essentially what a join does)
- The result schema includes all attributes from both relations
6.2.6.4 Natural Join (\(\bowtie\))
The natural join combines two relations by matching tuples that have equal values for all attributes with the same name. It’s like a cross product followed by filtering for equality on common attributes, then eliminating duplicate columns.
Notation: \(R \bowtie S\)
Definition: Natural join automatically:
- Finds all attributes that appear in both \(R\) and \(S\)
- Keeps only tuple pairs where these common attributes have equal values
- Eliminates one copy of the duplicate attributes
Example: Join Bridge and Sensor on their common attribute Bridge_ID
\[\text{Bridge} \bowtie \text{Sensor}\]
This implicitly enforces Bridge.Bridge_ID = Sensor.Bridge_ID and removes the duplicate Bridge_ID column.
Result:
| Bridge_ID | Name | Year_Built | Length | Sensor_ID | Type | Install_Date |
|---|---|---|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 | S001 | accelerometer | 2020 |
| B001 | Roberto Clemente Bridge | 1928 | 244 | S002 | strain_gauge | 2020 |
| B002 | Fort Pitt Bridge | 1959 | 1201 | S003 | temperature | 2019 |
| B003 | Fort Duquesne Bridge | 1969 | 1201 | S004 | accelerometer | 2021 |
Notice Bridge B004 doesn’t appear (no sensors are installed on it).
Formal definition using other operators:
Natural join can be defined in terms of cross product, select, and project. If \(R\) and \(S\) share attributes \(A_1, \ldots, A_k\):
\[R \bowtie S = \pi_{\text{all attributes, removing duplicates}}(\sigma_{R.A_1 = S.A_1 \land \cdots \land R.A_k = S.A_k}(R \times S))\]
Key points:
- Automatically finds common attributes—no need to specify join condition
- Very convenient when relations are designed with consistent naming
- Eliminates one copy of join attributes (otherwise you’d have Bridge_ID appearing twice)
- Result includes only tuples with matching values—unmatched tuples are excluded
6.2.6.5 Theta Join (\(\bowtie_\theta\))
The theta join (or just “join”) is the most general form of join. It combines tuples from two relations based on an arbitrary condition \(\theta\).
Notation: \(R \bowtie_\theta S\)
Definition: \(R \bowtie_\theta S = \sigma_\theta(R \times S)\)
In words: Theta join is a cross product followed by a select with condition \(\theta\).
Example 1: Find pairs of bridges and sensors where the sensor was installed after the bridge was built
\[\text{Bridge} \bowtie_{\text{Year\_Built} < \text{Install\_Date}} \text{Sensor}\]
This keeps pairs where the bridge’s construction year is before the sensor’s installation year.
Result:
| Bridge_ID | Name | Year_Built | Length | Sensor_ID | Type | Bridge_ID | Install_Date |
|---|---|---|---|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 | S001 | accelerometer | B001 | 2020 |
| B001 | Roberto Clemente Bridge | 1928 | 244 | S002 | strain_gauge | B001 | 2020 |
| B001 | Roberto Clemente Bridge | 1928 | 244 | S003 | temperature | B002 | 2019 |
| … | … | … | … | … | … | … | … |
(Note: This would include all combinations where Year_Built < Install_Date, which includes many nonsensical pairings)
Example 2: A more practical theta join—finding which sensors are on which bridges:
\[\text{Bridge} \bowtie_{\text{Bridge.Bridge\_ID} = \text{Sensor.Bridge\_ID}} \text{Sensor}\]
This is actually equivalent to natural join when the condition is equality on same-named attributes.
Key points:
- Theta join is the basic operation implemented in DBMSs—the term “join” in database systems often means theta join
- The condition \(\theta\) can be any boolean expression involving attributes from both relations
- Natural join is a special case of theta join (where \(\theta\) is equality on all common attributes)
- Unlike natural join, theta join does not eliminate duplicate columns
6.2.6.6 Union (\(\cup\))
The union operator combines tuples from two relations vertically, producing a single relation containing all tuples that appear in either input relation (or both).
Notation: \(R \cup S\)
Definition: \(R \cup S = \{t \mid t \in R \text{ or } t \in S\}\)
Requirement: \(R\) and \(S\) must be union-compatible—they must have the same number of attributes, and corresponding attributes must have compatible types. Formally, \(R\) and \(S\) must have the same schema.
Example: Suppose we have two separate monitoring systems with their own Bridge relations:
PittsburghBridges:
| Bridge_ID | Name |
|---|---|
| B001 | Roberto Clemente |
| B002 | Fort Pitt |
AlleghenyBridges:
| Bridge_ID | Name |
|---|---|
| B002 | Fort Pitt |
| B003 | Fort Duquesne |
\[\text{PittsburghBridges} \cup \text{AlleghenyBridges}\]
Result:
| Bridge_ID | Name |
|---|---|
| B001 | Roberto Clemente |
| B002 | Fort Pitt |
| B003 | Fort Duquesne |
Notice Fort Pitt (B002) appears only once—duplicates are eliminated (set property).
Key points:
- Union combines tuples vertically (adding rows)
- Both relations must have identical schemas
- Duplicates are eliminated (set-based operation)
- This is the same union operation from set theory in Module 4
6.2.6.7 Difference (\(-\))
The difference operator returns tuples that are in the first relation but not in the second relation.
Notation: \(R - S\)
Definition: \(R - S = \{t \mid t \in R \text{ and } t \notin S\}\)
Requirement: Like union, \(R\) and \(S\) must be union-compatible (same schema).
Example: Using the same bridge relations:
\[\text{PittsburghBridges} - \text{AlleghenyBridges}\]
Result:
| Bridge_ID | Name |
|---|---|
| B001 | Roberto Clemente |
Only Roberto Clemente appears because it’s in PittsburghBridges but not in AlleghenyBridges.
Example 2: Find bridges with no sensors
\[\pi_{\text{Bridge\_ID}}(\text{Bridge}) - \pi_{\text{Bridge\_ID}}(\text{Sensor})\]
This gets all bridge IDs from Bridge, then removes those that appear in Sensor, leaving only bridges without sensors.
Result:
| Bridge_ID |
|---|
| B004 |
Key points:
- Difference is not commutative: \(R - S \neq S - R\) (in general)
- Both relations must have identical schemas
- Useful for finding what’s in one relation but not another
- Same as set difference from Module 4
6.2.6.8 Intersection (\(\cap\))
The intersection operator returns tuples that appear in both input relations.
Notation: \(R \cap S\)
Definition: \(R \cap S = \{t \mid t \in R \text{ and } t \in S\}\)
Requirement: \(R\) and \(S\) must be union-compatible (same schema).
Example:
\[\text{PittsburghBridges} \cap \text{AlleghenyBridges}\]
Result:
| Bridge_ID | Name |
|---|---|
| B002 | Fort Pitt |
Only Fort Pitt appears because it’s in both relations.
Key points:
- Intersection finds common elements
- Both relations must have identical schemas
- Can be expressed using difference: \(R \cap S = R - (R - S)\)
- Same as set intersection from Module 4
6.2.6.9 Rename (\(\rho\))
The rename operator changes the name of a relation and/or its attributes. This is useful for disambiguation and for making schemas compatible for set operations.
Notation:
- \(\rho_{S}(R)\) — rename relation \(R\) to \(S\)
- \(\rho_{S(A_1, \ldots, A_n)}(R)\) — rename relation to \(S\) and attributes to \(A_1, \ldots, A_n\)
Use cases:
1. Unifying schemas for set operations
Suppose we have:
OldBridges(ID, BridgeName) and NewBridges(Identifier, Name)
They have different schemas, but we want to take their union. We can rename to make them compatible:
\[\rho_{\text{Bridges(Bridge\_ID, Name)}}(\text{OldBridges}) \cup \rho_{\text{Bridges(Bridge\_ID, Name)}}(\text{NewBridges})\]
2. Disambiguation in self-joins
A self-join is when a relation is joined with itself. We need to rename one copy to distinguish them.
Example: Find pairs of bridges built in the same year
\[\sigma_{\text{B1.Year\_Built} = \text{B2.Year\_Built} \land \text{B1.Bridge\_ID} < \text{B2.Bridge\_ID}}(\rho_{\text{B1}}(\text{Bridge}) \times \rho_{\text{B2}}(\text{Bridge}))\]
Here we rename Bridge to B1 and B2 so we can refer to them separately.
Alternate notation—Assignment statements:
Instead of deeply nested expressions, we can use assignment statements (a sequence of instructions):
B1 := Bridge
B2 := Bridge
SameYear := σ(B1.Year_Built = B2.Year_Built AND B1.Bridge_ID < B2.Bridge_ID)(B1 × B2)
Result := π(B1.Name, B2.Name, B1.Year_Built)(SameYear)
This is more readable for complex queries—each step is named and can be referenced later.
Key points:
- Rename is essential for self-joins (comparing a relation to itself)
- Useful for making schemas compatible for set operations
- Assignment notation provides an algorithmic way to express complex queries
6.2.6.10 Composing Operators for Complex Queries
The real power of relational algebra comes from composing operators—building complex queries by combining simpler operations. Because each operator returns a relation, the output of one operation can be the input to another.
Example: “Find the names of bridges that have accelerometer sensors”
We need to:
- Filter sensors to just accelerometers
- Join with bridges to get bridge information
- Project to get just bridge names
Step-by-step:
- \(\text{AccelSensors} := \sigma_{\text{Type} = \text{'accelerometer'}}(\text{Sensor})\)
- \(\text{BridgesWithAccel} := \text{Bridge} \bowtie \text{AccelSensors}\)
- \(\text{Result} := \pi_{\text{Name}}(\text{BridgesWithAccel})\)
As a single expression:
\[\pi_{\text{Name}}(\text{Bridge} \bowtie \sigma_{\text{Type} = \text{'accelerometer'}}(\text{Sensor}))\]
Result:
| Name |
|---|
| Roberto Clemente Bridge |
| Fort Duquesne Bridge |
Example 2: “Find bridges built before 1950 that are longer than 300 meters”
\[\sigma_{\text{Length} > 300}(\sigma_{\text{Year\_Built} < 1950}(\text{Bridge}))\]
Or equivalently, using a compound condition:
\[\sigma_{\text{Year\_Built} < 1950 \land \text{Length} > 300}(\text{Bridge})\]
Result:
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B004 | Smithfield Street Bridge | 1883 | 361 |
Key insight: Relational algebra is compositional—you can build arbitrarily complex queries by chaining operators together. This is the foundation for how DBMSs process queries: they break down complex SQL queries into sequences of relational algebra operations, then optimize the order and method of execution.
6.2.7 From ER Diagrams to Relational Models
Now we bring everything together: translating the ER diagrams from Module 4 into actual relational schemas that can be implemented in a database. This translation process is mechanical—there are specific rules for converting each ER diagram component into relational tables.
6.2.7.1 Translation Rules: Entity Sets to Relations
The most basic translation rule is straightforward: each entity set becomes a relation (table).
Rule: For each entity set \(E\) with attributes \(A_1, A_2, \ldots, A_n\):
- Create a relation \(E\) with attributes \(A_1, A_2, \ldots, A_n\)
- The key attribute(s) of the entity set become the primary key of the relation
Example: The Bridge entity set from Module 4:
ER Diagram: Bridge entity with attributes (Bridge_ID, Name, Year_Built, Material, Length), where Bridge_ID is the key.
Relational Schema:
Bridge(Bridge_ID, Name, Year_Built, Material, Length)
PRIMARY KEY: Bridge_ID
As a table:
| Bridge_ID | Name | Year_Built | Material | Length |
|---|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | steel | 244 |
| B002 | Fort Pitt Bridge | 1959 | steel | 1201 |
Example 2: The Sensor entity set:
ER Diagram: Sensor entity with attributes (Sensor_ID, Type, Install_Date), where Sensor_ID is the key.
Relational Schema:
Sensor(Sensor_ID, Type, Install_Date)
PRIMARY KEY: Sensor_ID
Key points:
- Each entity becomes a row in the relation
- Entity attributes become relation attributes
- Key attributes become primary keys
- Multi-valued attributes require special handling (we’ll see this shortly)
6.2.7.2 Handling Relationships
Relationships in ER diagrams connect entity sets. How we translate them depends on the multiplicity (1:1, 1:n, or n:m).
Notice that in class (and in the slides) I only idscussed the n:m case, recommending that we create a new table for the relationship always. These notes also consider the special cases of 1:1 and 1:n relationships. While you can still create a new table for those relationships, it may be more efficient to avoid that and instead use foreign keys (or merge the tables) in some situations as discussed by Claude here.
6.2.7.2.1 One-to-One Relationships
For a one-to-one (1:1) relationship between entity sets \(E\) and \(F\):
Option 1: Add a foreign key to one of the relations
- Choose either \(E\) or \(F\) (typically the one participating fully in the relationship)
- Add \(F\)’s primary key as a foreign key attribute in \(E\) (or vice versa)
Example: Suppose each Bridge has exactly one TaxRecord and each TaxRecord belongs to exactly one Bridge.
ER Diagram: Bridge (1) — Has_Tax_Record — (1) TaxRecord
Relational Schema:
Bridge(Bridge_ID, Name, Year_Built, Length)
PRIMARY KEY: Bridge_ID
TaxRecord(Tax_ID, Assessment_Value, Last_Updated, Bridge_ID)
PRIMARY KEY: Tax_ID
FOREIGN KEY: Bridge_ID REFERENCES Bridge(Bridge_ID)
We added Bridge_ID as a foreign key in TaxRecord.
Option 2: Merge into a single relation
- If the relationship is total (every entity participates), you could combine both entity sets into one table
- This is less common but can simplify queries
6.2.7.2.2 One-to-Many Relationships
For a one-to-many (1:n) relationship from \(E\) to \(F\) (one \(E\) relates to many \(F\)s):
Rule: Add the primary key of the “one” side as a foreign key in the “many” side.
Example: One Bridge is monitored by many Sensors (from Module 4).
ER Diagram: Bridge (1) — Monitored_By — (n) Sensor
Relational Schema:
Bridge(Bridge_ID, Name, Year_Built, Length)
PRIMARY KEY: Bridge_ID
Sensor(Sensor_ID, Type, Install_Date, Bridge_ID)
PRIMARY KEY: Sensor_ID
FOREIGN KEY: Bridge_ID REFERENCES Bridge(Bridge_ID)
Each sensor has a Bridge_ID foreign key indicating which bridge it’s on.
As tables:
Bridge:
| Bridge_ID | Name | Year_Built | Length |
|---|---|---|---|
| B001 | Roberto Clemente Bridge | 1928 | 244 |
| B002 | Fort Pitt Bridge | 1959 | 1201 |
Sensor:
| Sensor_ID | Type | Install_Date | Bridge_ID |
|---|---|---|---|
| S001 | accelerometer | 2020 | B001 |
| S002 | strain_gauge | 2020 | B001 |
| S003 | temperature | 2019 | B002 |
Notice multiple sensors can have the same Bridge_ID (many-to-one from Sensor’s perspective).
6.2.7.2.3 Many-to-Many Relationships
For a many-to-many (n:m) relationship between entity sets \(E\) and \(F\):
Rule: Create a new relation (called a junction table or linking table) with:
- Primary keys from both \(E\) and \(F\) as foreign keys
- The combination of these foreign keys forms the primary key
- Any attributes of the relationship itself
Example: Engineers inspect Bridges, and each engineer can inspect multiple bridges, while each bridge can be inspected by multiple engineers.
ER Diagram: Engineer (n) — Inspects — (m) Bridge
The relationship Inspects has attributes: Inspection_Date, Condition_Rating
Relational Schema:
Engineer(Employee_ID, Name, Specialty)
PRIMARY KEY: Employee_ID
Bridge(Bridge_ID, Name, Year_Built, Length)
PRIMARY KEY: Bridge_ID
Inspects(Employee_ID, Bridge_ID, Inspection_Date, Condition_Rating)
PRIMARY KEY: (Employee_ID, Bridge_ID, Inspection_Date)
FOREIGN KEY: Employee_ID REFERENCES Engineer(Employee_ID)
FOREIGN KEY: Bridge_ID REFERENCES Bridge(Bridge_ID)
As tables:
Inspects:
| Employee_ID | Bridge_ID | Inspection_Date | Condition_Rating |
|---|---|---|---|
| E001 | B001 | 2024-01-15 | Good |
| E001 | B002 | 2024-01-20 | Fair |
| E002 | B001 | 2024-02-10 | Good |
This captures that Engineer E001 inspected both B001 and B002, and Bridge B001 was inspected by both E001 and E002.
Key insight: Many-to-many relationships cannot be captured with just foreign keys in the entity tables—you must create a junction table.
6.2.7.3 Weak Entity Sets in Relational Models
Weak entity sets depend on another entity set for their identity. When translating to relations:
Rule:
- Create a relation for the weak entity set
- Include all its own attributes
- Include the primary key of the identifying (owner) entity set as a foreign key
- The primary key of the weak entity is a composite key: its own partial key plus the owner’s primary key
Example: Sensor Readings depend on Sensors (from Module 4).
ER Diagram: Sensor (strong) — Generates — (weak) Reading
Reading has partial key Sequence_Number (unique within each sensor, but not globally).
Relational Schema:
Sensor(Sensor_ID, Type, Bridge_ID)
PRIMARY KEY: Sensor_ID
Reading(Sensor_ID, Sequence_Number, Timestamp, Value)
PRIMARY KEY: (Sensor_ID, Sequence_Number)
FOREIGN KEY: Sensor_ID REFERENCES Sensor(Sensor_ID)
As a table:
Reading:
| Sensor_ID | Sequence_Number | Timestamp | Value |
|---|---|---|---|
| S001 | 1 | 2024-01-01 08:00 | 0.05 |
| S001 | 2 | 2024-01-01 08:10 | 0.06 |
| S002 | 1 | 2024-01-01 08:00 | 120.5 |
| S002 | 2 | 2024-01-01 08:10 | 121.0 |
Notice Sequence_Number alone doesn’t uniquely identify a reading—you need both Sensor_ID and Sequence_Number.
6.2.7.4 ISA Hierarchies and Inheritance
ISA hierarchies (subclasses) have multiple translation approaches. The choice depends on your query patterns and storage efficiency needs.
Approach 1: Single Table (ER-Style) - Create one table for the superclass with all possible attributes - Add a Type column to distinguish subclasses - Use NULL for attributes that don’t apply to a given subclass
Approach 2: Table Per Subclass - Create separate tables for each subclass - Each subclass table contains all attributes (inherited + specific) - No table for the superclass
Approach 3: Table Per Hierarchy (Hybrid) - Create one table for the superclass with shared attributes - Create separate tables for each subclass with subclass-specific attributes only - Subclass tables reference the superclass via foreign key
Example: Bridge has subclasses Suspension_Bridge, Arch_Bridge, Truss_Bridge (from Module 4).
ER Diagram:
- Bridge (superclass) with attributes (Bridge_ID, Name, Year_Built, Length)
- Suspension_Bridge with specific attributes (Cable_Diameter, Tower_Height)
- Arch_Bridge with specific attributes (Arch_Span, Rise)
Approach 1 - Single Table:
Bridge(Bridge_ID, Name, Year_Built, Length, Bridge_Type,
Cable_Diameter, Tower_Height, Arch_Span, Rise)
PRIMARY KEY: Bridge_ID
| Bridge_ID | Name | Year_Built | Length | Bridge_Type | Cable_Diameter | Tower_Height | Arch_Span | Rise |
|---|---|---|---|---|---|---|---|---|
| B001 | Roberto Clemente | 1928 | 244 | suspension | 0.5 | 45 | NULL | NULL |
| B002 | Fort Pitt | 1959 | 1201 | arch | NULL | NULL | 300 | 60 |
Approach 3 - Table Per Hierarchy:
Bridge(Bridge_ID, Name, Year_Built, Length)
PRIMARY KEY: Bridge_ID
Suspension_Bridge(Bridge_ID, Cable_Diameter, Tower_Height)
PRIMARY KEY: Bridge_ID
FOREIGN KEY: Bridge_ID REFERENCES Bridge(Bridge_ID)
Arch_Bridge(Bridge_ID, Arch_Span, Rise)
PRIMARY KEY: Bridge_ID
FOREIGN KEY: Bridge_ID REFERENCES Bridge(Bridge_ID)
Tradeoffs:
- Single Table: Simple queries for common attributes, but many NULLs and wasted space
- Table Per Subclass: No NULLs, but queries for common attributes require UNIONs
- Table Per Hierarchy: Balanced approach, but requires JOINs to get complete entity
6.2.7.5 Complete Example: Bridge Monitoring System
Let’s translate the complete ER diagram from Module 4:
Entity Sets:
- Bridge (Bridge_ID, Name, Location, Year_Built, Materials)
- Engineer (Employee_ID, Name, Specialty, Hire_Date)
- Sensor (Sensor_ID, Type, Install_Date)
- Manufacturer (Manufacturer_ID, Company_Name)
Relationships:
- Bridge (1) — Monitored_By — (n) Sensor
- Sensor (n) — Produced_By — (1) Manufacturer
- Engineer (n) — Inspects — (m) Bridge (with attributes Inspection_Date, Condition_Rating)
- Engineer (n) — Supervises — (n) Engineer (self-relationship)
Weak Entity:
- Reading (depends on Sensor) with partial key Seq_Number
Complete Relational Schema:
-- Entity sets become tables
Bridge(Bridge_ID, Name, Location, Year_Built, Materials, Length)
PRIMARY KEY: Bridge_ID
Engineer(Employee_ID, Name, Specialty, Hire_Date)
PRIMARY KEY: Employee_ID
Sensor(Sensor_ID, Type, Install_Date, Bridge_ID, Manufacturer_ID)
PRIMARY KEY: Sensor_ID
FOREIGN KEY: Bridge_ID REFERENCES Bridge(Bridge_ID)
FOREIGN KEY: Manufacturer_ID REFERENCES Manufacturer(Manufacturer_ID)
Manufacturer(Manufacturer_ID, Company_Name)
PRIMARY KEY: Manufacturer_ID
-- Weak entity
Reading(Sensor_ID, Seq_Number, Timestamp, Value)
PRIMARY KEY: (Sensor_ID, Seq_Number)
FOREIGN KEY: Sensor_ID REFERENCES Sensor(Sensor_ID)
-- Many-to-many relationships become junction tables
Inspects(Employee_ID, Bridge_ID, Inspection_Date, Condition_Rating)
PRIMARY KEY: (Employee_ID, Bridge_ID, Inspection_Date)
FOREIGN KEY: Employee_ID REFERENCES Engineer(Employee_ID)
FOREIGN KEY: Bridge_ID REFERENCES Bridge(Bridge_ID)
Supervises(Supervisor_ID, Supervisee_ID)
PRIMARY KEY: (Supervisor_ID, Supervisee_ID)
FOREIGN KEY: Supervisor_ID REFERENCES Engineer(Employee_ID)
FOREIGN KEY: Supervisee_ID REFERENCES Engineer(Employee_ID)
Key observations:
- Each entity set became a table
- One-to-many relationships (Bridge-Sensor, Manufacturer-Sensor) are captured by foreign keys in the “many” side
- Many-to-many relationships (Engineer-Bridge Inspects, Engineer-Engineer Supervises) required junction tables
- The weak entity Reading has a composite primary key
- Self-relationships like Supervises are handled like any other many-to-many relationship
This relational schema can now be implemented in any relational DBMS using DDL (as we saw in the Database Creation Process section), populated with data, and queried using relational algebra or SQL.